MLX is real: training on a Mac Studio for the first time

Spent a weekend doing actual training work on the Mac Studio rather than the usual inference-only experimentation. MLX is meaningfully more capable for training than it was six months ago. Worth being plain about what works, what doesn't, and what it means for the Apple Silicon training story.

A sleek silver computer box on a wooden home-office desk with a soft glow on the status indicator and scattered notes beside it

I've been doing inference work on the Mac Studio since it landed in March, and the inference story is settled. Apple Silicon plus MLX runs the workhorse-tier open-weights models comfortably. The training story is the part I'd been skeptical of. The conventional read on MLX has been "fine for inference, immature for training, use NVIDIA when training matters." That read was right six months ago. I spent last weekend doing actual training work on the Studio for the first time, and the conclusion is that the conventional read needs to update.

MLX is meaningfully more capable for training than it was at the start of the year. Not parity-with-CUDA capable; meaningfully-useful capable. Let me walk through what worked, what broke, and what it means for the principled personal-AI crowd.

Why MLX matters, unified memory Unified memory pool 64–192 GB · no copies between processors CPU GPU Neural Engine On a desktop with separate GPU memory, every model load is a copy. On Apple Silicon, the model loads once.
The model loads once. Every chip sees the same memory.
MLX training feasibility on a Mac Studio M4 Max 64GB vs CUDA. LoRA fine-tunes work well, full fine-tunes viable at smaller scales, pretraining from scratch is cloud-only.

What I actually did

The workload: fine-tune a Llama 3.1 8B base on a curated subset of my own writing, about 50K tokens of finished prose, structured with prompt-completion pairs derived from the structure of the source material. Goal: a model that helps me draft in my own voice for early-draft work, before the editorial pass moves it toward the published shape.

Not a serious research training run. The kind of weekend project that the principled-user population I keep writing about might actually do. Small, well-scoped, the kind of thing that's been "theoretically possible on Apple Silicon" for two years and "actually doable in practice" for somewhere between zero and six months.

The setup: MLX-Examples repository's LoRA fine-tuning script as the starting point. Custom data preparation. Custom evaluation harness against held-out samples. Six hours of total training time spread across two evenings, with the Studio doing other things (inference, indexing) in between.

What worked

A few things that worked better than I expected.

LoRA fine-tuning at the 8B scale ran cleanly. The MLX LoRA build is mature enough to handle this without me writing custom kernels or fighting the framework. Throughput was reasonable, not as fast as the equivalent on a 5090, faster than I'd estimated for Apple Silicon. The full run finished in roughly the time the same workload would take on a moderately-spec'd CUDA workstation.

Memory management held up. 64 GB of unified memory is enough for an 8B base plus the LoRA adapter weights plus the optimizer state plus reasonable batch sizes. The unified-memory design means I didn't have to play the "what fits in VRAM" game that constrains comparable NVIDIA setups; the budget is the box's total RAM minus the OS overhead. Comfortable.

The framework's debugging surface is usable. Loss curves, gradient norms, the basic instrumentation you need for training. Not as polished as the WandB integration I'd use on CUDA; functional. Good enough to actually understand what was happening across the run.

Quantization-aware fine-tuning (QLoRA-style) works. I ran one experiment fine-tuning against a 4-bit quantized base. The MLX quantization path is mature enough that this works with reasonable quality preservation. The memory savings let me push to slightly larger batch sizes.

Inference of the resulting model is seamless. The fine-tuned LoRA loads cleanly into the same MLX inference stack I use for the base model. No format conversion, no platform switching. The output quality is what I'd expect for a small fine-tune on personal data, recognizably my voice on prose generation, not magic, useful for early-draft work.

What broke or is still rough

A few places where MLX still feels less mature than the CUDA equivalent.

Multi-GPU / multi-machine training is not really there. The Studio is one box. Distributed training across multiple Apple Silicon machines exists in research preview but isn't production-ready. For workloads that need this, NVIDIA still wins decisively.

Some optimizer choices aren't well-supported. The standard ones (AdamW, SGD with momentum) work fine. Some of the more exotic optimizers I'd reach for in research-grade fine-tuning don't have well-tested MLX versions. Worked around it by sticking to the well-supported ones; in a research context that constraint would matter more.

Training data loading is slower than it should be. The data-loading pipeline I cobbled together had to do more on-the-fly tokenization than the equivalent CUDA setup would. Throughput was bottlenecked by data prep, not by the model compute, on the smaller-batch experiments. The MLX data-loading utilities are improving; they're not yet at the polish level of PyTorch's DataLoader.

Mixed-precision training has rough edges. FP16 / BF16 training works for the basic cases. Some of the precision-sensitive training tricks I'd use on CUDA (explicit loss scaling, mixed-precision gradient accumulation) needed more babysitting than they should.

Documentation is patchy. Most of my real questions got answered by reading the MLX-Examples source rather than by finding the right doc. The framework is improving; the documentation is lagging.

What this means for the Apple Silicon training story

A few things that landed for me from doing this.

The inflection point I wrote about for inference is starting to apply to training too. The training story isn't where the inference story is, it's about a year behind, maybe less. The trajectory is the same: Apple Silicon plus mature MLX equals "good enough for the workloads that matter to individuals and small teams" without needing the CUDA stack.

Personal-AI training is genuinely doable on the hardware I already own. I didn't need to rent GPUs. I didn't need to send my data to a cloud-training provider. The fine-tune happened entirely on a box that lives in my office, with data that never left the local network. For the privacy-bound use cases that motivate the home-AI-setup in the first place, this is a meaningful capability extension.

The cost story is wild. The marginal cost of the training run was a few dollars of electricity. The opportunity cost was the Studio doing inference at slightly slower throughput for a couple of evenings. Compared to the equivalent on rented cloud GPUs (call it $20-50 of compute time for the small workload, more for serious experiments), the local-training math wins for the small-experiment case the same way local inference wins for the always-on case.

The CUDA-or-bust dogma needs updating. The reflexive "if you're doing AI training you need NVIDIA" is increasingly wrong for the personal and small-team scale. It's still right at the production-research scale. Most personal-AI training falls into the former category, and the answer there is "Apple Silicon plus MLX is a real option now."

What I'd recommend trying

For someone in the principled-user crowd thinking about whether to do training experiments at home:

  • Start with LoRA fine-tuning at 7B-8B. This is the most-mature MLX training path. Lowest activation energy, highest probability of success.
  • Use the MLX-Examples repo as the starting template. The fine-tuning script there is a reasonable scaffold to extend.
  • Curate your data carefully. Most of the value of personal-data fine-tuning is in the data quality, not the model quality. A small, clean dataset beats a larger, messy one.
  • Don't expect parity with CUDA throughput. It's reasonable; it's not as fast as a comparable NVIDIA setup. The trade-off is the privacy and operational profile of running on hardware you own.
  • Plan for some framework friction. You'll hit edges. Read the source. The community is helpful but small; don't expect Stack Overflow answers for every problem.

The home-lab buyer's guide covers the broader buy decision; the training-on-Apple-Silicon story is the case where the Tier-2 setup I described earns more of its keep than just inference would justify.

The bigger frame

The MLX training story is the part of the personal-AI setup that was most clearly missing six months ago. With this weekend's experiment as the data point, it's missing less than I thought. The trajectory says "comfortably workable for individual training use cases by mid-2026, probably parity with mid-tier CUDA setups by 2027." That's faster than I'd have predicted at the start of the year.

The implication for individuals investing in the personal-AI stack: your hardware budget can include "and I might fine-tune things on this" rather than "and I'll rent GPUs when I want to fine-tune." That's a meaningful expansion of what owning the hardware enables. Worth knowing as the next round of buy decisions gets made.

MLX is real. Not for everything; not for everyone. Real for the use cases that matter most to individuals doing principled personal-AI work. The framework caught up faster than I expected. The next year tells whether it stays caught up or pulls further ahead.