Self-hosting open weights at home: what 2025 actually requires

The open-weights frontier is genuinely usable from a home rig now. The bottleneck has moved, it isn't capability anymore, it's memory bandwidth and patience.

An open computer tower with a glowing GPU and a holographic neural-network diagram floating above it on a home desk

The "your model on your machine" pitch was a thought experiment in 2023 when the LLaMA weights first leaked, and a workable curiosity by mid-2023 once Llama 2 made the licensing real. In early 2025 it is (for a meaningful set of workloads) a serious option. Not for everything. Not always cheaper. But the gap between "the frontier" and "what runs on my desk" is narrower than it has been at any point since GPT-3 made the question interesting.

Worth walking through what's actually buildable today, what the real bottlenecks are, and where self-hosting still doesn't make sense.

What "frontier" means at home in early 2025

The honest answer is: not the largest open-weights models. DeepSeek-V3 is a 671B-parameter mixture-of-experts model. R1 inherits the same shape. At BF16 weights that's well over a terabyte of model state; at INT4 it's still ~340GB. You can build a server that holds it, but "build a server" is not "self-hosting at home" for most people.

What you can run, comfortably, on hardware that fits under a desk:

  • Llama 3.3 70B, Meta's December 2024 release, is the practical top end. At INT4 quantization it fits in ~40GB and runs at single-digit-to-teen tokens-per-second on a single high-end consumer GPU or an Apple-Silicon machine with enough unified memory.
  • The DeepSeek-R1 distilled variants (released alongside R1) give you reasoning-model behavior in 7B / 14B / 32B / 70B sizes. The 32B distill on Qwen-2.5 is the sweet spot: it does enough chain-of-thought that you feel the difference, and it fits on a 24GB card.
  • Qwen-2.5 72B, Mistral-Large 2 at smaller quantizations, and the Gemma 2 lineup all sit in the same range.

That's a real catalog. None of those existed two years ago. The question of what to actually do with the catalog hasn't gotten easier, but the catalog itself has.

The hardware question, plainly

Two paths cover most of the home self-hosters I know in early 2025:

The Apple-Silicon path. A Mac Studio (M2 Ultra) or a Mac Mini Pro (M4) with enough unified memory (64GB at the low end, 192GB at the high end of what Apple ships in early 2025) runs the 70B-class models acceptably via MLX or llama.cpp's Metal backend. Tokens-per-second isn't as good as a top-end NVIDIA card, but the unified-memory architecture means you don't have to fight the model into a smaller card. You also get the energy and noise profile of a Mac, which is its own thing if the machine sits in your office.

The NVIDIA path. A single RTX 4090 (24GB) handles 32B-class reasoning distills well and 70B-class with aggressive quantization. The newer RTX 5090 (released late January 2025, 32GB) widens the runway. Two-card builds get you into 70B at higher precision, but at that point you're pricing out a serious workstation and the diminishing returns kick in fast for personal use. Power and cooling become real considerations, a 5090 under load is a small space heater.

A simplified read on which path makes sense for which workload:

Workload-to-model-to-hardware decision tree for self-hosted open-weights models in 2025: chat workloads use 8-14B distills on any 24GB GPU, reasoning workloads use 32B distills, serious local work uses 70B-class models on M-series 64GB+ or two-GPU rigs.

The software stack is mostly settled

Two years ago this section would have been a survey of half a dozen runtimes none of which were stable. In early 2025 the picks are clearer:

  • Ollama for the "I want it to work in five minutes" path. Ships with a model registry, has a clean CLI, exposes an OpenAI-compatible API surface. Good enough that it's become most people's default first install.
  • LM Studio for the "I want a GUI and to play with quantization settings" path. Useful for the "what does Q4_K_M actually feel like vs Q5" experimentation that you don't want to do in a script.
  • llama.cpp directly when you need to embed inference in something else, or when you need quantization formats Ollama doesn't expose.
  • vLLM if you've got a real GPU and want batch throughput. Not a casual install, but the right answer if you're actually serving multiple users or doing structured-output work at scale.
  • MLX on the Apple-Silicon side, both as a Python framework and as the engine behind a growing set of pre-converted model weights on Hugging Face.

The thing nobody warns you about is that switching between these for the same model gives you noticeably different output, different default sampling parameters, different tokenizer behavior at edge cases, different quantization implementations of "the same" Q4. If you're going to evaluate self-hosted models seriously, pin a single runtime and a single quantization for the comparison or you'll spend a week chasing differences that aren't about the model.

Where it doesn't make sense

Two cases I'd specifically call out:

  • If your workload needs the absolute frontier of capability, the closed-frontier shops are still ahead on the very hardest reasoning, the largest contexts, and the most polished tool-use behavior. The gap is narrower than it was in 2023 (narrower than it was in 2024) but at the very top end it's still real. If you're using AI for something where being wrong matters and you need every percentage point of capability, hosted-frontier is still the right call.
  • If your workload is bursty and small, the math on a $0.55-per-million-tokens DeepSeek R1 inference call is hard to beat with home hardware that you have to amortize. Hosting locally is for people who use the model a lot, who care about residency, or who care about the sovereignty of the workflow itself.

The thing that's changed is that "I want to run this at home" is now a defensible answer to "where should I host my AI workflow?" two years after that was an open question. The pieces are real. The ergonomics are tolerable. The bottleneck has moved from capability to bandwidth and patience, which is roughly what self-hosting has always been about.