High Request Latencies in US
Resolved
Dec 01 at 09:39pm UTC
Problem Description, Impact, and Resolution
At 18:08 UTC on November 29th customers in our US region started experiencing higher than normal latencies, at points reaching as high as 5 seconds, for all requests and some requests were failing with 504
or 499
status codes. During this incident, 7.1% of all requests failed. The immediate cause of the incident was a drastic increase in CPU usage on our API’s primary database in the US region because there was an issue with data indexing leading to some queries requiring increased CPU and Memory. At 18:33 UTC we deployed a maintenance solution that optimized the indexes and we saw an immediate decrease in CPU usage and latencies though it took until 19:29 UTC for the fix to be fully rolled out across the US region and for the incident to be fully resolved with latencies back to normal ranges.
The root cause for the incident was not immediately apparent and we prioritized a process to regularly optimize indexes as system load and volume increase and we put in place additional monitoring and alerts triggers for CPU level thresholds in our databases. The following day, on November 30th at 18:10 UTC, a similar increase in CPU usage occurred although the optimized indexes and monitoring in place enabled us to take mitigating actions to ensure that there was no impact on customers.
Following the second occurrence, we determined the root cause for this behavior in the database was an upgrade in our client library on November 29th resulting in queries that our database could not optimize for larger volume customers. We immediately rolled back the upgrade and have not seen a re-occurrence of the issue.
Mitigation Steps and Future Preventative Measures
To ensure this issue does not occur again we have:
- Increased the volume of data our automated tests use to ensure we catch any query degradations
- Regularly scheduled index maintenance
To ensure similar issues to this one do not arise in the future, we will be:
- Updating our load and acceptance testing plans with significantly more aggressive loads and data volumes.
- Increasing visibility and monitoring of low-level metrics that may indicate similar problems.
- Increased monitoring and alerting around degraded latencies experienced by our customers.
Affected services
🇺🇸 API
Updated
Nov 29 at 07:22pm UTC
Systems are fully operational.
Affected services
🇺🇸 API
Created
Nov 29 at 06:08pm UTC
We're experiencing instability and are investigating.
Affected services
🇺🇸 API