The Apple MLX path to a working assistant

I spent a few weekends pushing a local-first assistant onto core-01, my M4 Max Mac Studio with 64GB of unified memory. The path that works is Apple MLX. Here is why I picked it over llama.cpp, Ollama, and vLLM, and what the tooling actually feels like in 2026.

The Apple MLX path to a working assistant

I have been chasing the same thing for about a year now: a local-first assistant that runs on hardware I own, holds enough context to actually be useful, and doesn't phone home for every token. The hardware question I mostly settled last fall, core-01, my Mac Studio M4 Max with 64GB of unified memory, has turned out to be the right call for the small-shop inference work I do. The framework question took longer, and the answer I keep landing on after running it for real is Apple MLX.

Here's the writeup of why. Not the marketing version. The version I'd give to another engineer who is staring at the same matrix of options and trying to decide where to put their weekend.

The candidate set

Four serious options if you want to run modern open-weights models on a Mac in 2026:

  1. llama.cpp, the long-standing C++ inference engine with Metal support. The thing that made local LLMs viable in the first place, and still the default for a lot of people.
  2. Ollama, the user-friendly wrapper around llama.cpp. Great onboarding, easy model pulls, clean API.
  3. vLLM, the production-grade inference server everyone uses on NVIDIA. Has some Apple Silicon support now, but the support is best described as "experimental" and the optimization story is entirely CUDA-shaped.
  4. MLX. Apple's array framework, purpose-built for Apple Silicon's unified memory architecture. Open source, Python and Swift bindings, with mlx-lm as the de facto LLM frontend.

I tried all four on core-01 with the same workload, a 30B-class model handling a mix of code-completion-style requests and longer-context document work. What follows is what shook out.

Why MLX won for me

The honest answer is that MLX is the only one of the four that was designed for the machine I'm running it on. Everything else is a port.

llama.cpp does a respectable job of pretending Metal is just-another-backend, but you can feel the CUDA-shaped grooves underneath. Kernel choices, memory layout assumptions, the way the KV cache gets managed, all of it was built for discrete GPUs with separate VRAM, and adapted to Apple Silicon as a second act. Performance is fine. Tokens per second on a 30B-class model at 4-bit quantization is in the ballpark of what MLX delivers. But the ceiling is lower, and you hit it.

MLX, by contrast, was written by Apple ML researchers who wanted to use the actual architectural advantages of the M-series chips. Unified memory means CPU and GPU share the same physical RAM with no copies; MLX's lazy evaluation and computation graph were designed around that fact instead of around it. The throughput numbers I see on the same model and the same quantization are 25-40% higher than llama.cpp on identical inputs, and the long-context behavior is markedly better, the KV cache doesn't blow up the way it does on the llama.cpp Metal path past 32K tokens.

The Ollama story is shorter. Ollama is great if you want to ship a demo to a non-technical person in five minutes. For my use case, building a real assistant that I want to extend, fine-tune, and eventually wire into a more complex agent stack. Ollama is the wrong abstraction layer. It's a wrapper that hides the parts I need access to.

vLLM I want to like and can't, yet. The CUDA assumption is too deep. The Apple Silicon backend works but the project's center of gravity is still NVIDIA H100s in datacenters, and that shows up in every benchmark, every issue thread, every release note. Maybe in 18 months. Not now.

Performance, concretely

Some numbers from core-01, on a 30B-class instruction-tuned model at 4-bit quantization, short-context chat workload:

  • MLX (mlx-lm with the standard generation loop): 42-48 tokens/sec
  • llama.cpp (Metal backend, same quant): 31-36 tokens/sec
  • Ollama (which is llama.cpp underneath): 30-35 tokens/sec
  • vLLM (Apple Silicon path, experimental): 18-22 tokens/sec, with frequent backend warnings

Long-context (32K input, generating a few hundred tokens):

  • MLX: stays around 28-32 tokens/sec
  • llama.cpp: degrades to 12-15 tokens/sec
  • vLLM: I gave up at the 16K mark; the cache management isn't there

The MLX advantage compounds when the context gets longer because the framework is doing the unified-memory thing properly, there's no shuffling of activations between memory regions because there are no memory regions, just one pool the model can stretch out in. On a 64GB machine that means I can keep a 30B model resident, hold a meaningful KV cache, and still leave room for the rest of what I'm doing.

Tools, and the part that surprised me

The tools I expected to be the weak point. They aren't.

mlx-lm is good. The model loader handles HuggingFace checkpoints directly, the generation loop is hackable Python rather than a black-box binary, and the quantization tools (mlx_lm.convert with --quantize) work cleanly across the precision options I've tried (4-bit and 8-bit, mostly). When I want to drop down a layer to write custom kernels or experiment with a different attention implementation, MLX's array API is close enough to NumPy that I can be productive within an hour rather than a week.

The model availability question, which was the real blocker a year ago, has mostly resolved. The MLX community has been converting and uploading models at a steady clip; for any model that matters to me in 2026 (Llama, Qwen, Mistral, the major code-tuned variants, the smaller multimodal ones) there is a working MLX conversion within days of release. Not as instant as the GGUF tools, but close, and the quality of the conversions has been consistently good.

The other thing I didn't expect: the Swift bindings are real and they matter. If I want to wrap the inference loop into a native macOS app (which I do, eventually) MLX-Swift is a first-class path rather than an afterthought. llama.cpp has Swift bindings too, but again, they feel like a port. MLX-Swift feels like the framework Apple wishes UIKit-era developers had had for ML the whole time.

Where the friction still is

Three things to be honest about.

First. MLX is younger than llama.cpp and the rough edges show. Error messages can be cryptic. Some optimization techniques that are standard on the CUDA side (speculative decoding builds, advanced batching strategies) are present but less mature. The community is smaller, so the StackOverflow-equivalent answer pool is shallower; you end up reading source code more often than you would on a more established stack.

Second, fine-tuning on MLX works but is a different operating model than what the LoRA tools assume. The MLX team has published good fine-tuning examples and the LoRA path on mlx-lm is solid, but if you're used to the peft library and the broader HuggingFace fine-tuning toolchain, you're going to be doing some translation. I wrote more about the training feasibility question separately; the short version is that for my use case (small fine-tunes on personal data, not from-scratch training) MLX is workable.

Third, the deployment story for production-shaped workloads is not as polished as vLLM on NVIDIA. If I were running a multi-tenant API serving thousands of users, vLLM on a real GPU would still be the answer. But I'm not. I'm running an assistant for me, and the production-shaped concerns don't apply.

The bigger picture

The reason this matters, for me, is that the local-first assistant question is the load-bearing piece of the self-hosted stack I'm building toward. If the model can run on hardware I own, against data that never leaves my machine, with response times that are good enough for real interactive work, then the case for routing every query through somebody else's API gets a lot weaker. I have written before about Apple's neural engines and the PC industry's attempt to catch up; MLX is the software side of that hardware advantage actually paying off.

The state of the art is still in datacenters. I am not pretending otherwise. The frontier models that ship from the big labs are not going to fit on core-01, and for the work that needs the frontier I'll still use the APIs. But the gap between "frontier" and "good enough for 80% of what I want a local assistant to do" has closed faster than I expected. A 30B-class model on MLX is genuinely useful, not as a toy, as a working tool that drafts code, summarizes documents, holds a conversation across a long context window, and does it on hardware that lives on my desk.

That's the path. MLX is what makes it real.

Where I'd push back on myself

If you are starting from zero and just want a working local model on your Mac this weekend, install Ollama and call it done. The MLX argument I'm making here is for the case where you want to extend, customize, and build on top of the inference stack rather than just consume it. For the consume-only case Ollama is fine and arguably better because the path-to-first-token is shorter.

But if you're building something (and I am) MLX is where the leverage is. The framework rewards the time you put into understanding it. Llama.cpp gives you a fast plateau; MLX gives you a steeper learning curve and a higher ceiling. For a project that I expect to be running and evolving for years, the higher ceiling is the right trade.

That's where I am. core-01 is running MLX. The assistant is taking shape. The path is working.