API - 503 Service Unavailable
Resolved
Mar 26 at 09:57pm UTC
Problem Description, Impact, and Resolution
At 14:22 UTC on March 26th, 2025, the API in our primary region across both the US and EU services became unavailable. After a few seconds of unavailability, our services correctly redirected traffic to another region and successfully served traffic until 14:24 UTC, when these services also became unavailable. This resulted in customers receiving a 503 Service Unavailable
status code, preventing them from interacting with APIs.
The issue was caused by a policy change that removed access to an asset before it had been entirely de-provisioned from the health check of the underlying service; this caused the health check to fail and our system to remove it from rotation, causing a cascading error to our API. We rolled back the policy change at 14:25 UTC, and the issue was resolved at 14:32 UTC.
Mitigation Steps and Future Preventative Measures
We have fixed the issue with the policy change and successfully released the platform update. But, to ensure a similar issue does not occur again, we are actively working to solve the following as immediate fixes within our platform:
- In addition to our existing automated smoke testing release process, we will ensure these policy changes have specific smoke tests in our secondary region before being promoted to our primary region. We will also review our entire portfolio of services and ensure any missing smoke tests are added to remove the chance of complete regional cascading failures in the future.
- We are reducing our primary dependency on this service to reduce the impact and blast radius it can expose to our API.
- In addition to the automated steps above, we will review and update our code review processes and procedures to ensure manual checks are in place for policy changes.
We will also be investigating and prioritizing the following changes over the next few months:
- Introducing a new environment will further stage changes within our deployment process and create another smoke test gate within our release lifecycle.
- Simplifying coordination of service and policy deployment into a single deployable
Affected services
πΊπΈ API
πͺπΊ API
Updated
Mar 26 at 02:33pm UTC
This has been resolved. We will follow up with an RCA on this issue within the next 24 hours.
Affected services
πΊπΈ API
πͺπΊ API
Created
Mar 26 at 02:27pm UTC
API is currently having issues serving requests.
Affected services
πΊπΈ API
πͺπΊ API