The LLaMA leak and what \"your model on your machine\" could actually mean

Meta released LLaMA to researchers on February 24th. Three days later the weights were on 4chan. The rest of 2023 will be shaped by what happens next.

Hero image for: The LLaMA leak and what \"your model on your machine\" could actually mean

Meta released LLaMA to researchers on February 24th, a request-form, accept-the-license, weights-via-link kind of release. Three days later the entire weights bundle was sitting on 4chan as a torrent. By the end of the week, people were running the smaller checkpoints on M1 MacBooks, on gaming PCs, on Raspberry Pi clusters as a stunt. None of it was supposed to happen this quickly.

The leak itself is a footnote in the broader story. The story is that as of this week, "running a competent large language model on your own hardware" went from "research lab project" to "weekend tinker job." That gap mattered for a long time and now it doesn't. The question worth asking is what that actually changes, and where the friction still is.

What's in the box

LLaMA comes in four sizes, 7B, 13B, 33B, and 65B parameters. The 7B and 13B variants run on consumer hardware with some patience and a CPU-friendly inference layer (llama.cpp showed up almost immediately and is the practical way most people are running it). The 33B needs a serious GPU or aggressive quantization. The 65B needs hardware that most individuals don't have.

The capability story is more nuanced than the headlines. The 65B variant benchmarks competitively with GPT-3 on a lot of academic evaluations, not GPT-3.5, not the model behind ChatGPT. The 7B and 13B variants are usable for some tasks (summarization, simple Q&A, short generation) and noticeably weaker than ChatGPT for anything requiring multi-step reasoning or sustained coherence. None of the variants are instruction-tuned out of the box. You can have a conversation with the base model, but it's the kind of conversation where you're doing a lot of prompt-engineering work to get coherent outputs.

So: not a ChatGPT replacement. Not yet. But the foundation is now there for a lot of people who didn't have it.

Why this matters more than the benchmarks suggest

The benchmark gap will close, partially, with fine-tuning. Academic groups are moving on this already, the early word from a few different research labs is that you can take LLaMA-7B, fine-tune it on a small set of instruction-following examples (generated cheaply by querying ChatGPT, in some recipes that are starting to circulate), and produce something that behaves much more like ChatGPT for general-purpose chat. The cost of those fine-tunes is reportedly in the hundreds of dollars. The training-data generation costs about the same.

That's the part that should make people sit up. Not "the leaked model is good," because by ChatGPT standards it isn't. The part that matters is "instruction-tuning a leaked open base model now costs less than a decent laptop." The cost of producing a competent assistant has fallen by about three orders of magnitude in the days since the leak.

Whether that's a good or bad thing depends on what you're optimizing for. From a research perspective, it democratizes work that was bottlenecked on access. From a safety perspective, it removes whatever guardrails the model providers were enforcing. From the perspective of a small shop that wants its own assistant for its own data. It's the difference between "interesting in theory" and "you should probably have one by year-end."

What "your model on your machine" actually buys you

I've been chewing on what changes if I run a usable assistant locally instead of through an API, a question that didn't have a real answer a few weeks ago and now does. Three things, mostly:

Privacy by default, not as a contract clause. When the model runs on your hardware, the conversation never leaves the room. For a lot of professional use cases, anything involving customer data, anything involving PII, anything covered by an NDA, that's not a nice-to-have. It's the difference between "we can use this" and "we cannot use this."

No rate limits, no API keys, no per-token math. The cost model becomes electricity and depreciation. Your usage stops being something you have to budget. People underestimate how much that changes the kinds of things you'll try when there's no marginal cost per query.

You can fine-tune on your own data without sending it anywhere. Fine-tuning a closed model means handing your training corpus to the provider. Fine-tuning a local model is just a pile of GPU hours on your own hardware. The legal posture is completely different.

What it doesn't buy you, today: capability parity with the frontier models. If you need GPT-3.5-quality reasoning on hard problems, the open-weights ecosystem isn't there yet.

Where the friction still is

If you tried to set up a local LLaMA install this weekend, the friction would surprise you. The pieces work; the integration doesn't.

You need to acquire the weights (the leaked torrent or, if you're being legitimate, a research request that may or may not get approved). You need to get them into a format your inference layer can read (ggml for llama.cpp, full pytorch for native inference). You need to pick an inference layer (llama.cpp is the leader for CPU; for GPU you're picking between text-generation-webui, raw transformers, and a half-dozen experimental projects). You need to figure out quantization (q4_0 is fast and dumb, q4_1 is slightly slower and slightly less dumb, the full f16 is slow and accurate). You need to decide whether to use a base model or one of the early instruction-tuned forks. None of this is documented in one place. All of it is moving daily.

In two months, half of this friction will be gone, there's already a cottage industry forming around making the install one command. In six months, none of it will be there. The trajectory is clear; the pace is the surprising part.

Where this goes

Three things to watch over the rest of 2023:

Whether instruction-tuning becomes a one-button operation. The Alpaca recipe is a template. The next step is tooling that lets a non-researcher take a base model, point it at their own corpus, and produce a useful assistant. That tool doesn't exist yet. It will.

Whether bigger open models land. LLaMA-65B is the current ceiling. There are credible rumors of 100B-class open releases later this year. If those land with permissive enough licensing, the capability gap with closed frontier models tightens significantly.

Whether the legal layer keeps up. The LLaMA license is research-only. The leaked weights are being used in ways that license doesn't allow. The first commercial product built on top of leaked weights is going to test what enforceability means when the source is fundamentally a torrent. Nobody knows how that goes.

The thing I keep coming back to is the second-order effect. A year ago, the question "should I run a model locally for this use case" wasn't a real question, there was no model worth running. As of this week. It's a real question with a non-trivial answer. The category exists now. Most of what gets built in that category over the next year will be ugly and rough. Some of it will matter. Worth paying attention to which.