DSPy in real life: lessons from production

DSPy is the framework that's most-likely to look obvious in retrospect. The practitioners running it in production have learned things the documentation doesn't cover. Worth the field report.

A polished metal tuning fork resting on a dark wooden lab bench next to a brass measurement dial

DSPy is the framework that’s most-likely to look obvious in retrospect, “of course you’d treat prompts as optimizable code.” The practitioners running it in production through 2025 have learned a set of things the documentation doesn’t fully cover. Worth pulling them together because the marketing layer for DSPy is louder than the practitioner conversation, and the practitioner conversation has the more useful information.

I’ve been running DSPy on my own stack, local-first workloads, my homelab pipelines, personal-AI projects. The patterns below come from that work plus the public reporting and practitioner conversations I follow.

What DSPy actually does well

The strongest cases for DSPy in production:

Multi-step reasoning workloads with clear evaluation criteria. When you can define what “good” looks like and you have a workflow with multiple model calls in sequence, DSPy’s optimization across the whole pipeline produces meaningful quality lifts. The cases where I’ve gotten the most out of it are these.

Prompts that need to work across multiple models. Optimizing a prompt against an evaluation suite produces an artifact that ports better across models than a hand-crafted prompt does. The structured approach captures more of what makes the prompt work; the resulting prompt is less brittle.

Complex extraction tasks. Pulling structured data from unstructured input is the standard DSPy use case. The framework handles this well; the optimization loop produces extractors that meaningfully outperform first-draft prompts.

Tasks where the prompt drifts. When the prompt has to evolve as the underlying data evolves, DSPy’s “re-optimize against the current dataset” loop is a real workflow rather than a one-time tuning exercise.

These are the cases where the framework’s structure pays back consistently.

What DSPy doesn’t do well

The cases where DSPy adds friction without adding value:

Simple single-call prompts. A one-shot “summarize this” prompt doesn’t benefit from the framework’s structure. The optimization overhead is large; the quality lift is small.

Workloads without clear evaluation criteria. When you can’t define “good,” DSPy can’t optimize toward it. Trying to use the framework on subjective workloads produces a lot of optimization runs that don’t converge to anything useful.

One-off prototype work. The activation energy of setting up a DSPy module, an evaluation suite, and an optimization loop is high relative to “just write the prompt and iterate.” For prototypes, write the prompt directly.

Workloads where the model is going to change frequently. Each model swap requires re-optimization; the cost of the re-optimization is real. For workloads on a stable model, DSPy compounds value; for workloads that move models often, the compounding is reset each time.

These aren’t reasons not to use DSPy. They’re reasons to be deliberate about when it’s the right tool versus when it isn’t.

Things the documentation undersells

Three patterns the practitioner conversation surfaces that the docs don’t quite address:

The eval suite is 80% of the work. The optimization loop is straightforward; the evaluation suite that drives it requires real engineering. The DSPy work that actually pays off is eval-suite work plus a thin layer of DSPy framework work. Deployments that underestimate the eval-suite work end up with optimizations that converge against the wrong target.

The optimizer choice matters more than the docs suggest. BootstrapFewShot, MIPROv2, and the various other optimizers behave differently on different workloads. The docs cover when to use each at a high level; the practitioner reality is “you’ll try a few and see what works for your specific case.” Budget for the experimentation.

The metric isn’t the loss function. The metric you optimize against (the eval signal) is not the same as the loss function the model is trained against. DSPy lets you optimize a wrapper system against your domain metric; the lift comes from the domain-metric-aware optimization, not from changing the model. The framing matters because teams sometimes try to use DSPy to “fix” model issues; that’s not what it does.

Things the practitioners do that aren’t in the framework

A few patterns that have emerged in production deployments:

Version every optimized prompt. The output of a DSPy optimization run is a prompt artifact. Version it like code. Test it like code. Roll it back like code. The teams that treat DSPy outputs as ephemeral end up with prompt drift they can’t reproduce.

Run optimizations against held-out data. The temptation is to optimize against the same data you’ll evaluate on. Resist. Hold out a real validation set and report against it.

Cache the expensive evaluation calls. During an optimization run, the same input often gets evaluated many times. Caching at the evaluation layer produces meaningful speedups (10-50× on long runs) at minimal complexity cost.

Set explicit budgets for optimization runs. DSPy will happily run for hours-to-days if you let it. Budget the time and the API spend explicitly; stop the run when you hit either.

Pair DSPy with the 70/30 prompt-vs-context discipline. The optimization works best when the context layer is well-structured. Optimization on top of bad retrieval is wasted effort.

Where local DSPy fits

The local-DSPy pattern covers the privacy story. A few additional reasons local matters in production:

The optimization loop runs continuously without watching the bill. Hosted optimization is metered; local is essentially free at the marginal level. The teams that can run optimizations continuously have more iterations of optimized prompts than the teams that can’t.

The reproducibility is real. Local model versions are pinned. Hosted models change. The optimization results from a local run are reproducible months later in a way hosted runs aren’t.

The integration with the broader local-AI stack is pretty clean. When the inference happens locally, the optimization happens locally, the storage is local, the privacy story is intact end-to-end.

The teams running serious DSPy in production are increasingly running it locally. The hosted version remains the right default for casual use; the local version is the production-grade choice.

What I’d recommend

For teams considering or deepening DSPy use:

  • Start with one workload that fits the framework. Pick a multi-step extraction or structured-reasoning task with clear evaluation criteria.
  • Invest in the eval suite. It’s the part that compounds. The framework adds value on top of a good eval suite; without one, the framework adds friction.
  • Try multiple optimizers. Don’t commit to BootstrapFewShot just because it’s the first one in the docs. Evaluate against your specific workload.
  • Version everything. Optimization runs are reproducible if you version the inputs (model, eval suite, prompt template, training data). Make this routine.
  • Build the local capability. The local-DSPy setup pays back when you start running optimizations regularly.
  • Measure the lift. Optimized prompts should outperform baseline prompts on the eval signal. If they don’t, the framework isn’t the issue, the workload doesn’t fit the pattern.

DSPy is a real production tool with a meaningful practitioner population in late 2025. The framework’s value compounds for the workloads that fit and adds friction for the workloads that don’t. Worth being honest about which side of the line each of your workloads is on. The teams that pick the right workloads get genuine value; the teams that try to apply it everywhere get frustrated.

The framing that’s emerging: DSPy is to prompt engineering what testing-and-CI is to software engineering. Optional for one-offs; foundational for production. Worth having the muscle when the workload calls for it.