A Postmortem of Three Recent Issues
Source: https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues Author: Sam McAllister Published: Sep 17, 2025 Company: Anthropic
Summary
This is a technical report on three bugs that intermittently degraded responses from Claude. Between August and early September 2025, three infrastructure bugs caused degraded responses from Claude.
Key Points
How Anthropic Serves Claude at Scale
Claude is served via first-party API, Amazon Bedrock, and Google Cloud's Vertex AI
Deployed across multiple hardware platforms: AWS Trainium, NVIDIA GPUs, and Google TPUs
Each platform requires specific optimizations while maintaining strict equivalence standards
Timeline of Events
August 5: First bug introduced (context window routing error)
August 25-26: Two more bugs deployed
August 29: Load balancing change increased affected traffic
September 2-18: Fixes deployed across platforms
The Three Bugs
1. Context Window Routing Error
Some Sonnet 4 requests were misrouted to servers configured for 1M token context window
Initially affected 0.8% of requests, peaked at 16% on August 31
"Sticky" routing meant affected users continued getting degraded responses
2. Output Corruption
Misconfiguration on TPU servers caused token generation errors
Occasionally assigned high probability to wrong tokens
Produced Thai/Chinese characters in English responses, syntax errors in code
3. Approximate Top-k XLA:TPU Miscompilation
Code change triggered latent bug in XLA:TPU compiler
Related to mixed precision arithmetic (bf16 vs fp32)
Bug behavior was frustratingly inconsistent
XLA Compiler Bug Deep Dive
Models calculate probabilities for each possible next word
Use "top-p sampling" with threshold of 0.99-0.999
Precision mismatch between bf16 and fp32 caused issues
Approximate top-k operation sometimes returned completely wrong results
Fixed by switching from approximate to exact top-k
Why Detection Was Difficult
Evaluations didn't capture the degradation users reported
Privacy practices limited engineer access to user interactions
Each bug produced different symptoms on different platforms
Overlapping bugs created confusing, contradictory reports
What They're Changing
More sensitive evaluations
Quality evaluations running continuously on production systems
Faster debugging tooling
Better tools to debug community-sourced feedback while preserving privacy
Key Quote
"To state it plainly: We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone."
Last updated