Personal AI

The model that works on your laptop: March 2026 edition

The recurring laptop-and-Studio model rundown, March 2026 edition. Llama 4, Qwen 3, Mistral 3, Kimi K2 Thinking, and the small distills, what runs where, what it's actually good for, what I'd reach for first.

Sid Smith

02 Mar 2026 • 6 min read

Time for the recurring "what's the best local model right now" piece, March 2026 edition. The model release cadence keeps moving fast enough that any answer I gave six months ago is now wrong in interesting ways. Worth doing the rundown again on the same hardware I actually use day-to-day, with the same workloads I actually run, instead of pretending I have a benchmark lab.

The setup, for context. The laptop is a MacBook Pro M4 Pro 48GB, laptop-01 in my notation, the machine I take to coffee shops and the one that does most of my interactive work. The Studio is a Mac Studio M4 Max 64GB, core-01, the always-on machine that sits in the office and runs the heavier jobs. There's also a Mac mini M4 16GB (node-01) that does sequential utility work (TTS, STT, OCR, embeddings) and isn't really part of this conversation. That's the rig.

What's actually shipping right now

The model menu changed shape since the small-models tour I did in November. Quick state of the union:

Llama 4 family. Meta shipped 8B, 70B, and a 400B sparse-MoE flagship in late January. The 8B is the new go-to small model for most general tasks. The 70B is the new mid-tier for anyone with the RAM. The 400B is a data-center model dressed up in open-weights clothing, interesting, not relevant for laptop work.

Qwen 3. The 7B, 14B, 32B, and 72B all landed by February. Qwen 3 is genuinely the strongest open-weights coding model right now in the small-to-mid range. The 14B in particular punches above its weight on code tasks and runs comfortably on the laptop.

Mistral 3. Smaller release than the headline cycle suggests. The 22B Small and the 8x22B mixture model are both worth running. Mistral's strength continues to be reasoning-density-per-parameter, they make small models that think well. The 22B is my new daily driver for general writing-adjacent work.

Kimi K2 Thinking. Moonshot's reasoning-focused beast. The full version is a ~1T-parameter MoE that's not running on anything I own; the 32B distill that came out in February is a genuinely interesting reasoning model that fits on the Studio with quantization. Worth a look if you're doing anything that benefits from chain-of-thought style work.

Small distills, broadly. The distillation cycle from Kimi, DeepSeek, and the Llama 4 family has produced a wave of 3B-to-8B models that are weirdly capable for their size. The Qwen3-3B distill in particular is the model I'm reaching for on the laptop when I need fast iteration over latency over peak quality.

What runs on the laptop (M4 Pro 48GB)

The 48GB ceiling on laptop-01 is real but generous. With OS overhead and whatever else I'm running. I have ~36GB of usable model RAM in practice. That puts a hard cap somewhere around a 32B model at 4-bit quantization, with thinking room for context.

What I actually run, in rough order of frequency:

Mistral 3 22B at 4-bit (~14GB). The default. General writing, drafting, reformatting, summarization, conversational. Fast enough to feel responsive on MLX, smart enough to not embarrass itself. This is the model the laptop opens to.

Qwen 3 14B at 4-bit (~9GB). Code work. Faster than the 32B, smart enough for ~80% of the editor-adjacent code tasks I throw at a local model. The 32B is better; the 14B is faster and the latency matters more on a laptop.

Llama 4 8B at 4-bit (~5GB). The fast utility model. Quick reformulations, short rewrites, anything where I want sub-second time-to-first-token. Pairs well with the local DSPy work I've been doing, small model for the optimizer, larger model for the final pass.

Qwen 3 32B at 4-bit (~19GB). When I need the laptop to do actual heavy lifting and I don't mind it sounding like a hairdryer. Coding agents that need real reasoning, longer-form drafting, anything I'd otherwise wait to get to the Studio for.

What I don't bother with on the laptop: anything 70B+. The quantization compromises required to fit a 70B at 48GB make it worse than a properly-quantized 32B in practice. The Studio handles those.

What runs on the Studio (M4 Max 64GB)

core-01 has more headroom but it's also doing other things. RAG indexes, the agent fleet, whatever's running in the background. Usable model RAM is around ~50GB on a good day. That puts a practical ceiling at 70B at 4-bit, which is the regime worth living in.

Llama 4 70B at 4-bit (~40GB). The new heavyweight default. This is what the Studio runs when I want a model that's actually competitive with hosted frontier models on reasoning-adjacent tasks. Slow (8-12 tokens/sec depending on context) but the quality is the highest I get locally.

Qwen 3 72B at 4-bit (~42GB). The coding-specific heavy. When I'm asking a local model to do work that I'd otherwise send to a hosted Claude or GPT, this is the one. Genuinely capable on long-context code work; this is the local-only path to non-trivial coding agents.

Kimi K2 Thinking 32B distill at 4-bit (~19GB). The reasoning specialist. When I have a hard problem that benefits from explicit chain-of-thought, this is the one I reach for. Not as fast as the 70B but the reasoning quality on math/logic/structured-thinking work is noticeably better.

Mistral 8x22B at 4-bit (~80GB ⚠). Doesn't actually fit cleanly. I've run it with aggressive quantization and offloading and it works but it's painful. Listed for completeness; I don't actually use it.

What I reach for, by workload

The mapping I've settled into, based on three months of daily use:

Quick-iteration interactive work (drafting, reformulating, conversation): Mistral 3 22B on the laptop. Fast, smart enough, doesn't make me wait.

Code editing assistance (inline completions, small refactors, docstrings): Qwen 3 14B on the laptop. Fast first-token, good enough quality, low context-switching cost.

Code agents and longer code work (multi-file changes, agentic loops, anything that needs to maintain state): Qwen 3 72B on the Studio. Slower per token, but the per-task quality justifies the wait when the agent is working autonomously.

Reasoning-heavy work (planning, structured analysis, anything that benefits from think-step-by-step): Kimi K2 Thinking 32B distill on the Studio. The chain-of-thought training shows up in the outputs.

General-purpose heavy (anything I'd otherwise hand to a hosted frontier model): Llama 4 70B on the Studio. The closest thing I have to a local generalist that can hold its own.

Background utility (TTS, STT, OCR, embeddings, anything that fits the sequential-utility pattern): node-01 with the small specialized models that have been there for a year.

What changed since November

Three things stand out, comparing now to the November small-models piece:

The mid-tier got dramatically better. The 14B-to-22B band is where the most interesting work happened over the last three months. Models in that range now do work I would have needed a 70B for in mid-2025. The implication for laptop-only users is real, the 48GB MBP is now genuinely capable, where in mid-2025 it was capable-with-caveats.

MoE went from interesting to useful. The MLX work plus the maturation of MoE inference on Apple Silicon means the sparse-MoE models are actually viable now where six months ago they were demos. The 8x22B Mistral and the smaller Llama 4 MoE variants are real options if you have the RAM.

Reasoning-trained models became a category. Kimi's K2 Thinking, the DeepSeek R-series successors, the various Llama 4 reasoning variants, these are now a distinct workload class, not a curiosity. The mapping above reflects that; six months ago I would have collapsed reasoning into general-purpose.

The honest summary

If you have an M4 Pro 48GB laptop, run Mistral 3 22B as your default and Qwen 3 14B for code. Add Llama 4 8B for the fast-utility cases. That's your kit; it covers most of what you'd want a local model for and it leaves room to run the rest of your machine.

If you have an M4 Max 64GB Studio (or equivalent), Llama 4 70B is the new general-purpose heavy. Qwen 3 72B is the coding-specific. Kimi K2 Thinking 32B distill is the reasoning specialist. That's the trio; pick by workload.

The foundation I've been writing about for two years keeps maturing on the same trajectory it's been on. Apple Silicon plus open-weights plus MLX-and-llama.cpp tooling delivers more per dollar of hardware every six months. The model menu in March 2026 is meaningfully better than the one in November, which was meaningfully better than the one in May. The trend continues; the next checkpoint is probably September, when the post-Llama-4 cycle will have produced its own next wave.

I'll do this rundown again then. The shape of the answer will be different; the shape of the question (what runs where, for what, on the hardware you actually own) stays the same. That's the part that matters.