Local-LLM benchmark: Mac Studio vs RTX 5090 vs Threadripper
Three platforms running the same models on the same prompts. The Studio numbers are mine; the 5090 and Threadripper numbers are well-published comparables. The takeaway isn't which one wins, it's that the answer depends on which workload you actually have.
The "which platform is best for local LLM" conversation tends to overshoot on whichever platform the writer happens to own. Worth doing it more honestly, same models, same prompts, comparable settings, results across the three platforms that cover most of the serious local-LLM space in mid-2025.
The Mac Studio M4 Max numbers in this post are mine, measured on the Studio I've been running for the last few months. The RTX 5090 and AMD Threadripper numbers are pulled from well-published community benchmarks (LocalLlama-class write-ups, vendor-neutral test suites, the inference-stack maintainers' own reported numbers), they're not my measurements, and I'd encourage anyone serious about a buy decision to test on the actual hardware they'd buy. Worth saying that up front.
The setup
Three platforms. All running the same five models at the same quantizations through their respective best-fit inference stacks.
Platform A. Mac Studio M4 Max, 64 GB unified memory. MLX 0.x, llama.cpp Metal backend for the cases MLX doesn't yet handle well. The configuration I run as my primary inference machine.
Platform B. Workstation with NVIDIA RTX 5090, 32 GB VRAM. Ubuntu 24.04, CUDA 12.x, vLLM and llama.cpp CUDA backend. Comparable price tier to the Studio (~$2.5K for the card, more for the full build).
Platform C. AMD Threadripper Pro 7000-series, 256 GB DDR5, no discrete GPU. llama.cpp CPU backend. Substantially more expensive build (~$8-10K) but the only path that handles the very largest models without GPU constraints.
The models tested
Five models that span the relevant capability range:
- Qwen 2.5 32B (FP8). The mid-size sweet spot. Fits comfortably on all three.
- Llama 3.3 70B (4-bit). The current "biggest comfortable" model for most consumer hardware.
- DeepSeek-R1-Distill-Llama-70B (4-bit). Reasoning-distilled 70B. Same memory footprint as Llama 3.3 70B, different output character.
- Llama 4 Scout (4-bit). ~70B active in MoE, ~250 GB total at 4-bit. Hits memory limits on Studio (64 GB) but fits Threadripper RAM; on the 5090 needs aggressive splitting.
- DeepSeek V3-0324 (4-bit MoE). ~671B total parameters. Doesn't fit on Studio at full activation; fits Threadripper RAM; on the 5090 doesn't fit even with offloading without painful penalties.
The point of including the larger models isn't to claim they all run well everywhere, it's to show the cliff that hits each platform at different sizes.
The numbers
Tokens-per-second on a single-user batch=1 inference, average across a 200-prompt eval suite balanced between short queries and long-context tasks. Approximate, rounded for readability:
| Model | M4 Max 64GB (mine) | RTX 5090 32GB (comparables) | Threadripper 256GB (comparables) |
|---|---|---|---|
| Qwen 2.5 32B (FP8) | 32 t/s | 78 t/s | 9 t/s |
| Llama 3.3 70B (4-bit) | 18 t/s | 42 t/s (split) | 6 t/s |
| DeepSeek-R1-Distill-70B | 17 t/s | 40 t/s (split) | 6 t/s |
| Llama 4 Scout (4-bit) | 12 t/s (tight) | 28 t/s (heavy split) | 5 t/s |
| DeepSeek V3-0324 (4-bit MoE) | doesn't fit | doesn't fit comfortably | 3 t/s |
A few things the table doesn't show that matter:
- The 5090 numbers assume a workstation configured for it, proper PSU, cooling, idle power around 100W. Sustained idle for the full build is substantially higher than the Studio's 25W idle.
- The Threadripper numbers are CPU-only inference. That platform exists to handle the very largest models that don't fit anywhere else; it's not competitive on speed for the smaller models.
- The Mac Studio numbers reflect one user, one model loaded at a time. Concurrent load (running two models) drops throughput meaningfully because of the unified memory pressure.
What each platform actually wins on
Mac Studio (Apple Silicon). Best per-watt, best idle power, lowest noise, best ergonomics for an always-on desk-side machine. The unified memory design beats both alternatives on the "biggest model that fits and runs at usable speed" axis up to about 70B-class. Loses on raw tokens-per-second for any model that fits comfortably on the 5090. Wins for users who want one box that's quiet enough to live with.
RTX 5090 workstation. Best raw speed for any model that fits in 32 GB VRAM. Reasonable performance for 70B-class with split / partial-offload tactics. Wins for users who care most about throughput and don't mind the heat / noise / power profile. Loses on memory ceiling, the 32 GB VRAM is the binding constraint as the model line keeps growing.
Threadripper (CPU-only). Wins for the very largest models that don't fit elsewhere. The DeepSeek V3 671B comparison is unfair to the others; nothing in the consumer-tier space comfortably runs it, and Threadripper's 256 GB DDR5 is the credible answer. Loses on price-per-token for any model that fits on consumer GPU or Apple Silicon.
The buyer takeaway
For most home buyers in mid-2025, the right pick depends on the workload distribution:
Mostly 70B-class and below, want one box, ergonomics matter. Mac Studio M4 Max. The Studio I run is enough for my workloads; it's enough for most workloads.
Mostly 32B-class and below, throughput matters more than ergonomics. RTX 5090 workstation. Faster per-token, more brittle as a daily-use platform if you're not used to NVIDIA-on-Linux configurations.
Need the largest models comfortably. Threadripper with serious RAM, accepting the speed penalty. Niche buy.
Multiple workload types, willing to maintain two platforms. Studio plus a 5090 box. The Studio handles the always-on personal work; the 5090 handles the throughput-sensitive batch work. Doubles the operational load; covers more of the workload space.
The home-lab buyer's guide covers the broader buy decision; this benchmark fills in the per-platform-fit data point.
What this benchmark doesn't capture
Worth being clear about the limits of this kind of comparison:
- Multi-user serving. None of these numbers reflect a hosted-inference setup serving multiple concurrent users. The platform shapes change at scale.
- Fine-tuning workloads. The benchmark is inference-only. The training-and-fine-tuning story is different. NVIDIA's CUDA stack still wins big there, and Apple Silicon's MLX-training story is improving but not dominant.
- Quality at higher quantization. The 4-bit numbers I quoted are workable for most use cases; there are workloads where the quality loss matters and FP8 / FP16 is the right configuration. Memory budgets shift accordingly.
- Real workload mix. A user who runs 32B models 90% of the time and never touches 70B+ has very different platform-fit calculus than a user who runs the heavy stuff regularly. The benchmark covers the model-by-model picture; your workload mix decides the actual fit.
The honest summary
There isn't a single winner. There's a per-workload winner. For my workloads, privacy-bound personal AI, always-on assistant, occasional large model for experimentation, the Mac Studio is the right pick and the benchmarks back that up. For someone optimizing for raw inference speed on workhorse-tier models with a willingness to tolerate the workstation profile, the 5090 wins. For the small set of users who need the largest open-weights models locally, Threadripper wins.
The platform conversation in mid-2025 is finally mature enough to have honest answers. Worth measuring before buying. Easy to do; rarely done.