Failure modes: graceful degradation when something's down

Bedrock rate-limited, the Mac Studio offline, the customer asking something the AI can't handle, the graceful degradation patterns that fall back to human-only without the customer noticing.

Failure modes: graceful degradation when something's down

The first AI product I shipped that handled a real outage gracefully did so almost by accident. Bedrock had a regional throttling event one Tuesday afternoon, calls started returning 429s, then timing out, then returning slow. The system limped, partially recovered, limped again. Customers kept using it. Nobody noticed for about two hours. I hadn't been clever. I'd spent the previous month being burned by smaller, weirder failures, and out of self-preservation I'd built a few defensive patterns into the request path. When the big one hit, those patterns held.

This piece is about those patterns. The failure modes an AI product hits in its first year are largely predictable. The thing that distinguishes products that survive their first bad outage from ones that lose customers is whether they've decided, in advance, what the system does when something it depends on isn't there. That's the difference between graceful degradation and cascading collapse, an architecture question, decided on day one, revisited every time you add a dependency.

Graceful degradation chain Bedrock Sonnet normal Fall back: Bedrock Haiku rate-limited / slow Fall back: Llama on Bedrock continued degradation Queue + email customer all models down Pure human handoff AI fully offline circuit breaker Each tier kicks in when the one above it fails. Customer never sees a 500.
Graceful degradation chain

Layman version. A marketing strategist has productized her brand-positioning method into an AI service for small business owners. A customer asks a question. The model service is rate-limited today. Without a plan, the customer waits, gets a generic timeout, decides this product is broken. With a plan, the customer either gets a slightly slower answer (fell back to a smaller model), a clear we're routing this to a reviewer who'll respond within an hour message (already in the strategist's queue), or (for some questions) gets answered immediately because the AI recognized the question was outside its scope. None of those is the perfect product working perfectly. All of them are products the customer keeps using.

The four failure modes

I keep a short list at the top of my head, and every new feature gets walked through it before I ship.

External model service unavailable or throttled. Bedrock has a bad day, your model is rate-limited, latency spikes past your timeout. This is the one people plan for, and still the one that bites hardest, the failure is rarely binary, so a naive timeout-and-retry pattern piles load onto an already-struggling service.

Back-office infrastructure unavailable. In a hybrid stack, the Mac Studio side runs whisper transcriptions, fine-tune jobs, eval batches, sometimes the secret-sauce model. Power cut, network blip, RAM-stuck job, the cloud half has to know how to handle work that was supposed to run locally. Most stacks don't. They let the SQS queue grow until it dies of old age.

Customer query falls outside what the AI can handle. Nothing is technically broken. The model returned an answer. But the answer was wrong because the question was outside the playbook, the retrieval corpus had nothing relevant, the AI confabulated, the reviewer couldn't catch it because the answer sounded plausible. A failure of scope detection.

Internal AWS infra failure. Postgres failover takes longer than your Lambda timeout. The vector index is rebuilding. Cold-start cascades. Normal AWS-day failures with well-understood patterns, patterns only help if you've actually applied them.

Each gets a different response. Lumping them under "we'll retry" is the failure mode behind the failure modes.

Circuit breakers

A circuit breaker (a piece of code that stops calling a failing service for a while so it can recover, if you want to look it up later) is the most leveraged code you'll write for external-model failures. The wrapper that calls Bedrock keeps a counter of recent failures. When the failure rate crosses a threshold (say, five errors in thirty seconds) the breaker opens. While open, calls don't go to Bedrock; they immediately return a "service unavailable" signal. After a cooldown the breaker enters a half-open state and allows one probe call through. Probe succeeds, breaker closes; probe fails, breaker stays open.

The worst outcome during a partial outage is a thundering herd of your own retries hammering a struggling service. The breaker turns your product into a polite consumer rather than a contributor to the problem. Customers experience fast failures (which sounds bad but is good, fast failure leaves time to fall back) instead of long timeouts.

Key detail: breaker state is per model, not global. Sonnet throttled doesn't mean Haiku is. That's what enables the next pattern.

Fallback models, the cheaper sibling

Every model call is wrapped in a chain. Try Sonnet first; if its breaker is open or the call fails, try Haiku; if Haiku fails, decide whether to fail loudly, queue for later, or fall back to human-only.

The chain is per call site, not global. A high-stakes diagnose call: Sonnet only, no fallback, fail loudly to the queue. A low-stakes routing call: Sonnet primary, Haiku fallback, keyword classifier as a third tier. The chain lives in version-controlled prompt config, not buried in the model-calling code.

The honest tradeoff: the fallback is usually less capable. So the audit row gets a which model actually answered field, the eval suite knows about each tier, and CloudWatch alarms fire on the duration of degraded operation. The consultant knows this morning's batch ran degraded and might warrant extra review. Degraded mode is a state the product knows it's in, not a quiet quality drop.

Queueing for retry

Some failures are best handled by punting. The work isn't time-critical, a transcription job, a batch summary, a fine-tune trigger. Put it on a delay queue, return we've got this, results in a few minutes, process when capacity is back.

For a career coach who's productized her resume-and-positioning review service, this maps cleanly. A customer uploads a resume. The back-office summarization model is down today. They don't need feedback in twenty seconds, they're going to read it over coffee. UX: thanks, your review is being prepared, we'll email you within ten minutes. The work goes into SQS with a delayed-visibility timeout; the worker retries with backoff until the underlying service comes back.

The patterns are boring. Idempotent job handlers. Bounded retry counts with a dead-letter queue (after N failures, the job goes to DLQ and a human gets paged, it stopped being transient). Visible queue depth in dashboards. Clear customer messaging that doesn't lie about timing.

What you don't want is silent retry forever. I've inherited systems where SQS queues had hundreds of thousands of messages backed up, retrying every thirty seconds against a service deprecated months earlier. Nobody noticed because the retries didn't error visibly. Bound the retries. Page on the DLQ.

When the Mac Studio is just gone

The local side disappears for a dozen reasons, power cut, network outage, launchd job stopped, OS update reboot.

Cloud-side detection is a recurring heartbeat: every minute, the worker writes to a Postgres table or S3 key. A scheduled Lambda checks the heartbeat is recent. Stale, the Lambda fires an alarm and starts queueing flagged work in a holding pattern.

Customer experience depends on the work. Pure batch (fine-tune jobs, weekly evals, asynchronous voice transcription) the customer never notices; bursty work absorbs hours of delay. Synchronous path (a query needing the locally-hosted fine-tuned model) falls back to a cloud model, the audit row records the fallback, the eval expectations adjust. The product keeps working. It works less well, and the system knows it does, and the consultant knows the system knows.

Want to go deeper on the cloud/local split? The wiring is in the Mac Studio side of the stack and the hybrid sync pattern. The point here: the boundary needs explicit failure semantics, not a hopeful shrug.

The query that's outside scope

People don't classify this as a failure, because nothing technically broke. It's the one that does the most damage over time, because the AI confidently answered a question it shouldn't have touched.

A product PM consultant's decision-coaching service covers prioritization frameworks, opportunity sizing, stakeholder mapping. A customer types: what's the right legal structure for my new LLC? The AI has no business answering. It might still produce a plausible paragraph because language models pattern-match anything into sentences. The right response: that's outside what I'm trained to help with. Here's what I can help with instead.

Mechanism: scope detection at triage. Every query gets a quick in-scope/out-of-scope classification before reaching diagnose, a small LLM call with a tight prompt against the playbook scope. Out-of-scope queries get an honest "this is outside the product" response, with a referral path.

The eval harness needs out-of-scope test cases as much as in-scope ones. The failure mode you're testing for is the AI confabulating an answer to a question it shouldn't have touched. Catch it before a customer does.

The honest UX of "we couldn't"

The hardest pattern to get right is the customer-visible message when degradation kicks in.

Wrong messages I see most: a generic something went wrong page; a deceptive we're processing your request spinner that loops forever; a lengthy technical explanation the customer can't act on. None preserve trust.

The right shape is short, specific, actionable. Three things: what happened (plain language), what the system is doing (concrete next step), what the customer should do (or shouldn't have to). Example: We couldn't answer this one automatically, one of our reviewers is taking a look and will respond within an hour. You don't need to do anything; we'll email you. Honest about the limit. Specific about the recovery. Doesn't waste attention.

The mechanism behind the message has to deliver on what it says. The reviewer queue has to exist. Someone has to be watching it. The email has to land. If the message is a lie (if the queue is where requests go to die) you've made the failure worse than just showing an error, because you've also broken trust. The honest message requires honest infrastructure.

Test the failures, deliberately

The pattern I most underuse is deliberate failure injection. On a quiet weekday, flip a flag that simulates Bedrock returning 429s for thirty minutes. Watch what happens. Does the breaker open? Does the fallback chain engage? Does the customer-facing message show up?

The first time I ran one I discovered my fallback chain was correctly configured for nine of ten call sites and silently misconfigured for the tenth, the one nobody had touched in six months, which fell back to no fallback. You don't find that from reading code.

I do this monthly. Takes an hour. Catches the drift you can't see otherwise, the new Lambda that didn't get the wrapper, the prompt without a Haiku variant, the dashboard that stopped firing because a metric name changed.

The list, plainly

When I add a new dependency, I write down the answers to four questions before merging:

What does the system do if it's slow? (Timeouts, breaker thresholds.)

What does it do if it's unavailable for an hour? (Fallback chain, queueing, customer message.)

What does it do if it returns a wrong answer? (Eval coverage, scope detection, audit field for "which path served this".)

How will I know any of this happened? (Dashboards, alarms, audit fields, queue notifications.)

If any answer is "we'll figure it out when it happens," I don't merge. The figuring-out gets done in the calm hour before the dependency fails, not the panic hour after. Customers don't churn over a perfect product having an outage. They churn over an imperfect product whose outage broke trust. The architecture that keeps trust intact is mostly written down before the bad day arrives.