Back to overview
Downtime

API - 503 Service Unavailable

Mar 26 at 02:27pm UTC
Affected services
πŸ‡ΊπŸ‡Έ API
πŸ‡ͺπŸ‡Ί API

Resolved
Mar 26 at 09:57pm UTC

Problem Description, Impact, and Resolution

At 14:22 UTC on March 26th, 2025, the API in our primary region across both the US and EU services became unavailable. After a few seconds of unavailability, our services correctly redirected traffic to another region and successfully served traffic until 14:24 UTC, when these services also became unavailable. This resulted in customers receiving a 503 Service Unavailable status code, preventing them from interacting with APIs.

The issue was caused by a policy change that removed access to an asset before it had been entirely de-provisioned from the health check of the underlying service; this caused the health check to fail and our system to remove it from rotation, causing a cascading error to our API. We rolled back the policy change at 14:25 UTC, and the issue was resolved at 14:32 UTC.

Mitigation Steps and Future Preventative Measures

We have fixed the issue with the policy change and successfully released the platform update. But, to ensure a similar issue does not occur again, we are actively working to solve the following as immediate fixes within our platform:

  1. In addition to our existing automated smoke testing release process, we will ensure these policy changes have specific smoke tests in our secondary region before being promoted to our primary region. We will also review our entire portfolio of services and ensure any missing smoke tests are added to remove the chance of complete regional cascading failures in the future.
  2. We are reducing our primary dependency on this service to reduce the impact and blast radius it can expose to our API.
  3. In addition to the automated steps above, we will review and update our code review processes and procedures to ensure manual checks are in place for policy changes.

We will also be investigating and prioritizing the following changes over the next few months:

  1. Introducing a new environment will further stage changes within our deployment process and create another smoke test gate within our release lifecycle.
  2. Simplifying coordination of service and policy deployment into a single deployable

Updated
Mar 26 at 02:33pm UTC

This has been resolved. We will follow up with an RCA on this issue within the next 24 hours.

Created
Mar 26 at 02:27pm UTC

API is currently having issues serving requests.