High Request Latencies in US

Downtime

High Request Latencies in US

Nov 29 at 06:08pm UTC

Affected services

🇺🇸 API

Resolved
Dec 01 at 09:39pm UTC

Problem Description, Impact, and Resolution

At 18:08 UTC on November 29th customers in our US region started experiencing higher than normal latencies, at points reaching as high as 5 seconds, for all requests and some requests were failing with 504 or 499 status codes. During this incident, 7.1% of all requests failed. The immediate cause of the incident was a drastic increase in CPU usage on our API’s primary database in the US region because there was an issue with data indexing leading to some queries requiring increased CPU and Memory. At 18:33 UTC we deployed a maintenance solution that optimized the indexes and we saw an immediate decrease in CPU usage and latencies though it took until 19:29 UTC for the fix to be fully rolled out across the US region and for the incident to be fully resolved with latencies back to normal ranges.

The root cause for the incident was not immediately apparent and we prioritized a process to regularly optimize indexes as system load and volume increase and we put in place additional monitoring and alerts triggers for CPU level thresholds in our databases. The following day, on November 30th at 18:10 UTC, a similar increase in CPU usage occurred although the optimized indexes and monitoring in place enabled us to take mitigating actions to ensure that there was no impact on customers.

Following the second occurrence, we determined the root cause for this behavior in the database was an upgrade in our client library on November 29th resulting in queries that our database could not optimize for larger volume customers. We immediately rolled back the upgrade and have not seen a re-occurrence of the issue.

Mitigation Steps and Future Preventative Measures

To ensure this issue does not occur again we have:

Increased the volume of data our automated tests use to ensure we catch any query degradations
Regularly scheduled index maintenance

To ensure similar issues to this one do not arise in the future, we will be:

Updating our load and acceptance testing plans with significantly more aggressive loads and data volumes.
Increasing visibility and monitoring of low-level metrics that may indicate similar problems.
Increased monitoring and alerting around degraded latencies experienced by our customers.

Updated
Nov 29 at 07:22pm UTC

Systems are fully operational.

Created
Nov 29 at 06:08pm UTC

We're experiencing instability and are investigating.