Intermittent 500 Errors
Resolved
Nov 20 at 12:37am UTC
Problem Description, Impact, and Resolution
At approximately 11:30 UTC on November 18, 2025, customers began experiencing failures when calling any Basis Theory service due to a global outage at our regional edge provider, Cloudflare. The outage impacted DNS resolution, edge routing, and WAF processing across Cloudflareβs network, preventing customer requests from reaching our public endpoints.
During the peak of the event, Cloudflare returned 500-series errors for 100% of inbound traffic, resulting in a complete outage across all products, including Vault, Card Management Services, Elements, and 3DS. Customers reliant on Basis Theory for payment collection or card processing were unable to complete those operations during the outage window. As Cloudflareβs network partially recovered at various points, some customers experienced intermittent success, leading to inconsistent behavior across geographies and increased retry traffic.
Our internal systems, including Vault and all supporting services, remained fully healthy and available throughout the incident, with no degradation in performance or capacity. However, because Cloudflare serves as the first-hop ingress for all customer traffic, the global failure of their DNS and routing layers prevented any requests from reaching our infrastructure.
During the incident, we used our disaster recovery playbook to bypass Cloudflare by disabling Cloudflare edge processing and routing traffic directly to our AWS load balancers. These changes were unsuccessful, as Cloudflareβs degraded services appeared to prevent DNS updates from propagating externally. This left no viable alternative path to reroute production traffic during the outage.
Cloudflareβs global services began to gradually recover starting at 14:30 UTC, leading to a steady decline in error rates. By 17:30 UTC, Cloudflareβs network had fully stabilized, and all traffic to Basis Theory was routing typically, with request success rates returning to 100%. No additional corrective actions were required on our side once Cloudflare restored its global network, and no further customer impact was observed.
To prevent this type of outage in the future, we are redesigning our edge architecture to eliminate Cloudflare as a single point of failure, strengthening our operational playbooks for faster edge bypass during vendor outages, and enhancing our monitoring to detect regional or intermittent failures more promptly.
Detailed Timeline of Events
- 2025-11-28
- 11:30 (UTC) - Cloudflare begins having issues routing and serving requests across all Basis Theory domains (js.basistheory.com, api.basistheory.com, 3ds.basistheory.com).
- 11:57 (UTC) - First internal page fired due to synthetic test failures.
- 12:07 (UTC) - Some successful calls began sporadically succeeding across all products (intermittent recovery).
- 12:23 (UTC) - Vault traffic degraded , ~50% success rate observed.
- 12:47 (UTC) - Some successful calls began sporadically succeeding across all products (intermittent recovery).
- 13:00 (UTC) - Full global outage , 100% of customer traffic failing.
- Team initiates disaster recovery efforts under assumption Cloudflare load balancer and routing layers are failing. Plan established to bypass Cloudflare entirely.
- 13:40 (UTC) - Attempted mitigation: Updated Cloudflare DNS to route to api.basistheory.com to route traffic directly to AWS.
- Assumption was this would restore ~80% of US traffic; no improvement observed. It was deemed this chagne had no effect, likely because Cloudflareβs DNS changes were not fully propagating due to their degraded edge.
- 14:08 (UTC) - Cloudflare routing re-enabled; error rates spike due to system retries and customer retry logic.
- 14:30 (UTC) - Cloudflare systems began to recover. Error rates drop below 50%.
- 14:40 (UTC) - Error rates drop below 15% and continue declining over the next 40 minutes.
- 17:30 (UTC) - All customer requests succeeding; service fully restored.
Root Cause Explanation
Cloudflare experienced a global network failure affecting DNS resolution, edge routing, WAF processing, and global load balancing. Their incident summary is here, although the root cause of the Cloudflare outage does not change the fact that we have a single point of failure in our edge routing - https://blog.cloudflare.com/18-november-2025-outage/. Below is a description of why this outage caused a Basis Theory outage.
Basis Theoryβs architecture relies on Cloudflare for:
- Public DNS
- Global CDN
- WAF
- Traffic steering & routing
- Load balancing
Because Cloudflare is the exclusive ingress path for all customer traffic, their global outage made all Basis Theory products unreachable, even though our AWS infrastructure remained fully healthy.
Efforts to bypass Cloudflare through DNS changes were unsuccessful because Cloudflareβs internal DNS services were also degraded and unable to propagate changes. This prevented traffic from being routed directly to AWS.
As a result, Cloudflare became a critical single point of failure, and Basis Theory had no alternative routing path available during the incident.
What Worked and What Didnβt
What Worked
- Internal alerting correctly detected failures across all products
- Core infrastructure remained healthy and fully operational
- Internal traffic and health checks confirmed backend health throughout
What Didnβt Work
- Synthetic monitoring provided false confidence early in the event due to intermittent Cloudflare recoveries
- DNS Proxy-disable/fallback routing could not propagate during Cloudflareβs global degradation
- No independent DNS provider or edge-bypass path existed to re-route traffic during Cloudflare failure
- Cloudflareβs full ownership of DNS + routing + proxying created complete ingress lock-in
- Basis Theoryβs Status page was also impacted by these issues, causing a delay in communication on the impact of our systems on our customers.
Future Prevention & Next Steps
To ensure an outage of this scale cannot recur, we are implementing several improvements across our operational processes, monitoring strategy, and edge architecture.
Edge Architecture Redesign
We are re-architecting our ingress and edge routing strategy to eliminate Cloudflare as a single point of failure. This includes introducing multi-provider redundancy, improving automated failover capabilities, and enforcing service-level objectives that trigger autonomous routing changes when error or latency thresholds are exceeded.
Operational Readiness & Response Improvements
- We have updated our Support and Operations action plans to include clearer escalation paths and explicit procedures for cases where our status page or other external communication systems are degraded.
- We have validated and refined our disaster-recovery playbooks to ensure we can rapidly execute Cloudflare Edge bypass procedures.
Monitoring and Alerting Enhancements
We are reconfiguring our monitors to shorten detection windows and improve sensitivity to intermittent, region-specific failures. This will enable us to identify partial outages more quickly and make more informed decisions earlier during edge instability.
Customer-Controlled Bypass Path
We are developing a fully supported, customer-accessible bypass option that can be activated during severe edge degradation. This will provide customers with a direct path to our infrastructure when needed, even if an edge provider is impaired, and will serve as a last-resort mechanism to ensure traffic can still reach us if other mitigation steps fail.
These changes aim to significantly reduce dependency concentration, accelerate detection and mitigation, and ensure reliable access to Basis Theory services even in the event of large-scale external vendor outages.
Affected services
Updated
Nov 18 at 05:13pm UTC
Services have been fully restored. A full RCA will be provided within 24 hours.
Affected services
Updated
Nov 18 at 04:25pm UTC
Nearly all traffic has been restored.
Less than 2% of traffic is still seeing intermittent 500 errors; our edge service provider is continuing to restore full traffic.
Affected services
Updated
Nov 18 at 03:08pm UTC
We are seeing a large portion of our requests resolve and return successful responses.
We will continue to monitor the situation closely.
Affected services
Updated
Nov 18 at 02:34pm UTC
We are seeing a large portion of our requests resolve and return successful responses.
We will continue to monitor the situation closely.
Affected services
Updated
Nov 18 at 01:10pm UTC
Customers are still seeing elevated 500 errors.
We are continuing to investigate alternatives to edge routing outage.
Affected services
Updated
Nov 18 at 12:52pm UTC
Our systems are still experiencing active degradation, and some customers are encountering intermittent issues.
We are monitoring the situation and actively working on a bypass.
Affected services
Created
Nov 18 at 11:37am UTC
We are experiencing intermittent failures at our edge service provider, resulting in an increased number of 500 errors. We are monitoring the situation and will report back as soon as we have an update.
Currently, we are seeing a stable platform.
Affected services