The Mac Studio side of the stack
mflux, ollama/mlx-lm, fine-tuning, whisper, a batch runner, what runs on the desk, and the math on why it pays for itself in months.
A friend who runs a one-person creative studio asked me last fall whether she should "get an AWS account" to do AI work. She wanted to generate marketing images for her clients, transcribe interview recordings, summarize a stack of customer notes into a positioning brief. I asked her what kind of computer she had. She said a four-year-old MacBook Pro. I told her to buy a Mac Studio instead.
She did. The thing has paid itself back, in saved cloud bills and saved time, somewhere around four months in. She runs image generation, transcription, and summarization on it constantly. She has not opened an AWS console. She is shipping more work than she was when she was thinking about cloud architecture.
That is, in miniature, the case for the local side of the stack.
This piece is the actual contents of the Mac Studio in an AI MVP, what runs on it, what each tool does, what the up-front cost is, and what AWS bill it replaces. The previous piece in the series was the AWS-native cloud side. The piece before that was why the split makes sense in the first place. This one is the desk-side.
What goes on it
Five things, each with a specific job.
- An image-generation engine for back-office assets. Marketing illustrations, blog hero images, site graphics, social posts.
- Local text inference for batch work, eval runs, and internal tooling.
- A fine-tuning pipeline that turns captured judgment into a small custom model.
- A speech-to-text engine for transcribing voice notes and customer calls into training data.
- A scheduled batch runner that pulls work from cloud SQS, processes it locally, pushes results back.
Each of these is a separate tool. They share the Mac Studio's GPU and memory. Most of the time only one is running. When they need to overlap, they queue up.
Let me go through them.
Image gen: mflux
mflux is a Mac-native image-generation tool. (Image generation models are called diffusion models, if you want to look it up later, the technique generates an image by gradually denoising random pixels until they form a coherent picture.) mflux runs the Flux family of models efficiently on Apple Silicon. The output quality is comparable to what you'd get from a hosted service, the latency is a few seconds per image, and the marginal cost is electricity.
Use cases I lean on it for: marketing page hero images for a new feature, illustrations for a blog post (this site, actually), social-post graphics for the consultant's brand, internal diagrams.
The contrast with cloud: cloud image generation services charge somewhere in the range of a few cents to a few tens of cents per image, plus per-model-call overhead. Generating two hundred marketing assets in an afternoon is maybe $20-40 on a hosted service. Free on the Mac Studio. The first month of marketing-asset generation pays for the price difference between a smaller Mac and the one you actually want.
A career coach who's building out a productized resume-review service used this exact tool to generate the hundred-or-so before/after illustrations for her marketing site in a weekend. Hosted image services would have charged her a few hundred dollars for the same job. On the Mac Studio, it was an afternoon of electricity.
Text inference: ollama and mlx-lm
Two tools, slightly different strengths, both running open-weight text models on the Mac Studio.
Ollama is the friendly one. You run ollama pull llama3 (or whatever model you want) and ollama run llama3, and you have a working local model in two commands. It exposes an HTTP API on localhost that looks roughly like the OpenAI API, which means any tool that already knows how to talk to OpenAI can be pointed at Ollama with one config change. For internal tooling, evaluation scripts, and batch jobs that need a quick model call, this is what I reach for.
mlx-lm is the more serious one. It's an Apple-native inference library that gets the most out of the Mac Studio's hardware, unified memory, the neural engine, the GPU. For larger models or higher-throughput batch work, mlx-lm is the difference between "this takes overnight" and "this takes an hour." It's also the foundation for the fine-tuning workflow, which is the next item.
What I run them for: nightly summarization of the day's customer interactions, eval runs against golden examples whenever a prompt changes, batch generation of synthetic training examples to augment captured judgment, the consultant's internal review queue (which uses a local model to pre-classify queue items before the consultant looks at them).
The cost-replace math: every one of these jobs, run against Bedrock or a hosted API at iteration speed, would cost meaningful money per run. Per token, hosted is maybe $0.50-$3.00 per million tokens depending on the model. A nightly summarization batch alone could run to a few million tokens. Across a year, that's hundreds to thousands of dollars in cloud inference costs for batch work that doesn't need cloud latency. The Mac Studio does it for the electricity.
The capture path: mlx-lm fine-tuning
This is the secret-sauce-est use of the Mac Studio.
The piece on capturing the secret sauce described what the consultant's annotated examples, decision rules, and failure modes look like as captured material. The fine-tune is what turns that material into a small custom model, a model that's been trained to respond the way this consultant would, on inputs like the ones this consultant sees.
The mlx-lm fine-tuning pipeline does this. You feed it captured examples in a structured format. It produces a fine-tuned model (small enough to run on a single Mac Studio, big enough to encode the consultant's style and decision patterns). You upload the result to S3 as a weights artifact. The cloud side picks it up on its next cold-start. Done.
Capability note: fine-tuning is what closes the loop between captured judgment and live customer queries. Without it, you're stuck doing retrieval against captured material every time, which works but is slower and more expensive than having a small model that's already absorbed the consultant's patterns. The retrieval layer still exists (for the long-tail specifics) but the model handles the bulk of the style and judgment work directly.
Doing this kind of training in the cloud is technically possible and operationally expensive. Renting a GPU machine on AWS for a few hours of fine-tune work costs more than the electricity of running it locally, you have to provision and tear down the instance, you have to ship the training data back and forth, and you can't easily iterate on the recipe. On the Mac Studio it's one command and you can iterate freely.
An HR consultant who built an interview-rubric scoring service used this exact pattern. She captured a few hundred annotated interview transcripts (her notes on what she'd ask next, what she'd flag, what she'd score high or low). The fine-tune ran on a Mac Studio overnight. The resulting small model encoded her interview style well enough that the cloud side could use it for live screening conversations, with retrieval against the corpus for the situations that fell outside the trained patterns.
Transcription: whisper
Whisper turns audio into text. It runs locally on a Mac Studio at faster-than-real-time speeds for most use cases. The output is good enough that for English-language voice notes and customer calls, it's basically a solved problem.
Why this matters: a lot of the captured judgment from consultants comes from recorded conversations. Discovery calls, client interviews, training sessions, debriefs after a project. The consultant's spoken explanations of what they did and why are often more useful training material than their written notes, because people are looser and more honest in speech.
Whisper turns those recordings into structured transcripts. Those transcripts go into the captured-judgment corpus. The captured corpus feeds the fine-tune. The fine-tune produces the small custom model. The model goes to S3. The cloud picks it up. Loop closed.
Running whisper in the cloud is technically fine and costs real money per hour of audio. Running it locally is free after the cost of the machine. For a consulting practice with hours of audio per week, this alone justifies the local setup.
The batch runner
The last piece is the glue. A scheduled job (cron, or launchd on macOS, or one of the modern equivalents) that wakes up on a schedule, pulls work from the cloud SQS queue, runs it against whichever of the above tools is appropriate, and pushes results back.
The pattern is simple: cloud puts work on SQS (a fine-tune request, a batch eval, a transcript job). The batch runner polls SQS. When there's work, it processes it. When there's no work, it sleeps. Results go back to S3 or to a small status API. Failed jobs go to a dead-letter queue for inspection.
The whole batch runner is maybe a hundred lines of Python. The piece on the hybrid sync pattern walks through the actual code shape, but the headline is: this glue is not the hard part. The cloud and local pieces are well-formed enough that wiring them together is straightforward.
Want the full sync wiring detail? That's its own piece, how cloud and local actually talk, and it'll save you a few false starts if you read it before building the connector.
The cost story
Now the math, because this is where the local-vs-cloud argument lives or dies.
Up-front cost: a Mac Studio with enough memory and GPU for serious AI work runs somewhere between four and seven thousand dollars depending on configuration. Call it $5,000 for a reasonable starting machine.
Ongoing cost: electricity. A Mac Studio under load draws maybe 150-250 watts. Running it 24/7 at 200 watts averages about $200-400 a year in electricity at typical US residential rates. Most jobs only run for a few hours a day, so the real number is lower.
What it replaces, if you ran the same work in the cloud:
- Image generation for marketing: easily $30-100/month at moderate use, more if you're iterating on a campaign.
- Local text inference for batch and eval: $50-300/month depending on how aggressively you iterate. Eval runs alone can hit a few hundred a month if you're disciplined about regression testing.
- Fine-tuning: cloud fine-tune jobs on managed services run $50-500 per run depending on size, and you'll do dozens of runs over a year.
- Transcription: $10-50/month for moderate audio volume on cloud transcription services.
- Internal tooling running on a hosted model: $20-80/month.
Conservatively, the cloud equivalent of the work this machine does is somewhere between $150 and $1000 a month, depending on how active your team is. Even at the low end, the Mac Studio pays itself back in maybe two and a half years. At the more typical end (for a team that's actively iterating on an AI product) it pays back inside a year, often inside six months.
I dig into the runway math more carefully in what you pay before customers arrive. If you want the full pre-revenue cost picture. That's the piece.
How I actually run it
I should probably name what I have, since the brief invites it. I run three machines for the local side, engine-01, core-01, and store-01. They're roles, not redundancy. Engine-01 is the GPU-heavy one and does the fine-tunes and image gen. Core-01 is the always-on workhorse for batch and eval, it has been running essentially continuously for months without a reboot, which is the kind of stability you get from Apple Silicon that you do not get from a self-built Linux box. Store-01 holds the artifact mirror and the dataset library, it doesn't need much GPU, it needs lots of disk.
This is overkill for one MVP. One Mac Studio does the job for a single-consultant product. I have three machines because I'm doing too many things at once, which is its own problem. Start with one. You can always add more, and they cluster trivially because each one runs the same software and pulls from the same SQS queue.
What I want you to take from this
The local side of the stack is not a hobbyist setup or a nice-to-have. It is a serious piece of the architecture that does the work cloud is bad at (step-by-step, batch, training-heavy) at a fraction of the cost. The customer-facing side stays in the cloud, where it belongs. The kitchen runs on the desk, where the math makes sense.
Buy the machine. Install the five tools. Wire up the batch runner. The rest of the series is about the cloud side, the product loop, and the parts of the pattern that turn a working hybrid stack into a real AI product.
Tomorrow's piece is on auth and multi-tenancy, the cheapest piece of foundation to get right on day one and the most expensive piece to retrofit. Stay close.