Why distillation is the most underrated AI pattern of 2025
The headline AI pattern of 2025 is agentic-everything. The pattern that's quietly doing more useful work for actual production systems is distillation, taking a big expensive model's behavior and getting most of it from a much smaller cheaper one.
The headline AI pattern of 2025 is agentic-everything, the keynote energy, the platform investment, the marketing layer. Worth being explicit about a different pattern that’s quietly doing more useful work for actual production AI systems: distillation. Taking a big expensive model’s behavior on a specific task and getting most of it out of a much smaller, cheaper, faster model.
distillation-pattern
Distillation isn’t new. It’s been a known technique since the 2015 Hinton paper. What changed in 2025 is that the open-weights model line caught up enough that “distill from a frontier-tier model into a 7B-class workhorse” became a routine production pattern rather than a research curiosity. Practitioners doing this report outsized cost reductions on workloads where the agentic-everything story isn’t getting them anything. Worth being explicit about where it works and where it doesn’t.
distillation-pattern
The pattern, briefly
Distillation in the form that matters for 2025 production:
- Identify a specific task your AI system does. Customer-support classification. Document summarization. Code-review-comment generation. Anything where the workload is well-defined.
- Run the task through a frontier-tier model (Opus 4, GPT-5, the o-series) on a representative dataset. Capture the inputs, the outputs, and the reasoning where applicable.
- Use the captured input-output pairs as training data to fine-tune a smaller model on the specific task.
- Replace the frontier-tier model with the distilled smaller model in production for that task. Keep the frontier-tier model as fallback for the cases where the smaller one isn’t confident.
That’s it. The result: most of the quality at a fraction of the cost and latency.
What changed in 2025
Three specific things made this practical in a way it wasn’t a year ago:
The frontier-tier models are reliable enough teachers. A year ago, capturing the frontier model’s output for a task gave you noisy training data, the model would hallucinate, contradict itself, fail in ways that would propagate to the distilled student. The 2025 frontier-tier models (Opus 4, GPT-5, the better o-series variants) are reliable enough on bounded tasks that the captured outputs are usable training data without heavy filtering.
The open-weights workhorse models are good enough students. Llama 4 Scout, Qwen 2.5 32B, the smaller Llama 3 derivatives, these models can be fine-tuned to handle specific tasks at near-frontier quality for the task. The capability gap that used to require keeping the frontier model in production has narrowed enough that the distilled student is genuinely sufficient for many workloads.
The infrastructure for fine-tuning got cheap. Distributed training without a hyperscaler bill covers this, fine-tuning a workhorse model on a few thousand task examples is now a $25-200 operation, not a $5K commitment. The economics of distillation pay back immediately.
Where distillation wins concretely
The workloads where distillation produces meaningful production wins:
High-volume task-bounded inference. Customer-support classification at 100K+ queries/day. Document tagging at scale. Email-routing decisions. The workload is well-defined; the volume justifies the effort; the distilled model handles it at a fraction of the cost.
Latency-sensitive tasks. When the workload needs to respond in under 200ms and the frontier-tier model takes 1-2 seconds, distillation buys you the latency. The smaller model is faster; running it locally instead of round-tripping to the vendor adds more.
Batch processing. Embedding generation, summarization at scale, structured extraction. Workloads that run continuously in the background. The cost-per-run on the distilled model is a tiny fraction of the cost on the frontier-tier; at batch scale, the difference compounds into real money.
Privacy-bound deployments. When the workload can’t go to a hosted frontier model (data sensitivity, residency, compliance), the distilled model running locally on workhorse-tier hardware is the way the workload happens at all.
Cost-sensitive production paths. Anywhere the conversation-level cost tracking shows a high-volume task spending real money, that task is a distillation candidate.
These are real workloads. The distillation pattern produces 50-90% cost reductions on them with quality preservation in the 90-98% range against the frontier teacher.
Where distillation doesn’t help
A few categories where the pattern doesn’t pay back:
Open-ended tasks. When the workload is “answer arbitrary user queries about anything,” the frontier-tier model’s general capability is what you’re paying for. A distilled student can’t replace that breadth without becoming itself a frontier-tier-sized model.
Tasks where reasoning quality varies query-by-query. When the bound on quality is “needs to handle the hard cases right,” the distilled model’s quality on the hard cases is the binding constraint. If it’s noticeably worse on the hard cases (which is common), the frontier-tier model is the right choice for the task.
Low-volume tasks. When a workload runs a few hundred times a month, the cost of fine-tuning the distilled model exceeds the savings from running it. The break-even is around tens of thousands of queries per month for most reasonable assumptions.
Tasks that change frequently. When the task definition shifts every few weeks (new categories, new constraints, new failure modes), keeping the distilled model current is its own engineering task. For stable tasks distillation is great; for fluid tasks the frontier-tier model’s flexibility is worth the cost.
These aren’t reasons not to use distillation. They’re reasons to be selective about which workloads it fits.
The teacher-student combination
The pattern that’s emerged as production-mature: don’t replace the frontier-tier model with the distilled student. Use them together.
- The distilled student handles 95% of the queries. Fast, cheap, locally-runnable, quality-sufficient.
- The frontier-tier teacher handles the 5% the student isn’t confident on. The student’s confidence score drives the routing.
- The teacher’s output on the hard cases gets captured and added to the next round of training data for the student.
- The student gets retrained periodically against the accumulated hard-case data, gradually narrowing the cases that need to escalate.
The net effect: the cost on average is mostly the student’s (cheap), the quality on the hard cases is mostly the teacher’s (high), the system improves over time without manual intervention.
What’s missing in the tooling
The distillation pattern works. The tooling for it lags the pattern. A few specific gaps:
Capturing teacher outputs at scale. Most teams build their own pipeline for “run X queries through the frontier model, capture the structured outputs, format as training data.” The pattern is consistent enough that off-the-shelf tooling should exist; it mostly doesn’t.
Quality-gating the captured data. Teacher outputs aren’t all training-quality. Filtering for the high-quality cases (consistency, no contradiction with itself, no obvious failure modes) is its own engineering project.
The student-confidence calibration. When does the student escalate to the teacher? Naive thresholds work poorly; calibrated thresholds work well. The calibration is currently DIY.
The continuous improvement loop. “Capture, retrain, deploy, repeat” should be automatable. It mostly isn’t.
These gaps are addressable. The teams that build the missing tooling internally have a real advantage; the off-the-shelf tooling is two or three quarters away from being competitive with internal builds.
What I’d recommend
For teams running production AI workloads in late 2025:
- Audit your high-volume tasks for distillation candidates. The conversation-level cost tracking surfaces them. Pick the top 3-5; these are the ones where distillation produces meaningful savings.
- Don’t try to distill everything. The pattern is for bounded tasks; trying to distill open-ended workloads doesn’t work and burns time.
- Build the teacher-student pattern, not just the student. Replacement is brittle; teacher-as-fallback is reliable.
- Plan for periodic retraining. A distilled student that ships once and never updates is a context-drift problem waiting to happen.
- Use the cheap-fine-tuning infrastructure. A few hundred dollars per task per quarter is the realistic budget; that’s gettable on neoclouds without hyperscaler procurement.
The agentic-everything pattern is the loud story of 2025. The distillation pattern is the quiet story doing more for actual production economics. Worth being deliberate about it.
The teams that pair both, agentic systems for the workflows that need them, distilled task-specific models for the bounded high-volume work, outperform the teams that pick one pattern. Most production stacks should have both. Most production stacks have one. Worth fixing.