Back to overview
Downtime

Intermittent 500 Errors

Nov 18 at 11:37am UTC
Affected services
🌎 Elements
(🌎 Global) Portal
πŸ‡ΊπŸ‡Έ API
πŸ‡ͺπŸ‡Ί API

Resolved
Nov 20 at 12:37am UTC

Problem Description, Impact, and Resolution

At approximately 11:30 UTC on November 18, 2025, customers began experiencing failures when calling any Basis Theory service due to a global outage at our regional edge provider, Cloudflare. The outage impacted DNS resolution, edge routing, and WAF processing across Cloudflare’s network, preventing customer requests from reaching our public endpoints.

During the peak of the event, Cloudflare returned 500-series errors for 100% of inbound traffic, resulting in a complete outage across all products, including Vault, Card Management Services, Elements, and 3DS. Customers reliant on Basis Theory for payment collection or card processing were unable to complete those operations during the outage window. As Cloudflare’s network partially recovered at various points, some customers experienced intermittent success, leading to inconsistent behavior across geographies and increased retry traffic.

Our internal systems, including Vault and all supporting services, remained fully healthy and available throughout the incident, with no degradation in performance or capacity. However, because Cloudflare serves as the first-hop ingress for all customer traffic, the global failure of their DNS and routing layers prevented any requests from reaching our infrastructure.

During the incident, we used our disaster recovery playbook to bypass Cloudflare by disabling Cloudflare edge processing and routing traffic directly to our AWS load balancers. These changes were unsuccessful, as Cloudflare’s degraded services appeared to prevent DNS updates from propagating externally. This left no viable alternative path to reroute production traffic during the outage.

Cloudflare’s global services began to gradually recover starting at 14:30 UTC, leading to a steady decline in error rates. By 17:30 UTC, Cloudflare’s network had fully stabilized, and all traffic to Basis Theory was routing typically, with request success rates returning to 100%. No additional corrective actions were required on our side once Cloudflare restored its global network, and no further customer impact was observed.

To prevent this type of outage in the future, we are redesigning our edge architecture to eliminate Cloudflare as a single point of failure, strengthening our operational playbooks for faster edge bypass during vendor outages, and enhancing our monitoring to detect regional or intermittent failures more promptly.

Detailed Timeline of Events

  • 2025-11-28
    • 11:30 (UTC) - Cloudflare begins having issues routing and serving requests across all Basis Theory domains (js.basistheory.com, api.basistheory.com, 3ds.basistheory.com).
    • 11:57 (UTC) - First internal page fired due to synthetic test failures.
    • 12:07 (UTC) - Some successful calls began sporadically succeeding across all products (intermittent recovery).
    • 12:23 (UTC) - Vault traffic degraded , ~50% success rate observed.
    • 12:47 (UTC) - Some successful calls began sporadically succeeding across all products (intermittent recovery).
    • 13:00 (UTC) - Full global outage , 100% of customer traffic failing.
      • Team initiates disaster recovery efforts under assumption Cloudflare load balancer and routing layers are failing. Plan established to bypass Cloudflare entirely.
    • 13:40 (UTC) - Attempted mitigation: Updated Cloudflare DNS to route to api.basistheory.com to route traffic directly to AWS.
      • Assumption was this would restore ~80% of US traffic; no improvement observed. It was deemed this chagne had no effect, likely because Cloudflare’s DNS changes were not fully propagating due to their degraded edge.
    • 14:08 (UTC) - Cloudflare routing re-enabled; error rates spike due to system retries and customer retry logic.
    • 14:30 (UTC) - Cloudflare systems began to recover. Error rates drop below 50%.
    • 14:40 (UTC) - Error rates drop below 15% and continue declining over the next 40 minutes.
    • 17:30 (UTC) - All customer requests succeeding; service fully restored.

Root Cause Explanation

Cloudflare experienced a global network failure affecting DNS resolution, edge routing, WAF processing, and global load balancing. Their incident summary is here, although the root cause of the Cloudflare outage does not change the fact that we have a single point of failure in our edge routing - https://blog.cloudflare.com/18-november-2025-outage/. Below is a description of why this outage caused a Basis Theory outage.

Basis Theory’s architecture relies on Cloudflare for:

  • Public DNS
  • Global CDN
  • WAF
  • Traffic steering & routing
  • Load balancing

Because Cloudflare is the exclusive ingress path for all customer traffic, their global outage made all Basis Theory products unreachable, even though our AWS infrastructure remained fully healthy.

Efforts to bypass Cloudflare through DNS changes were unsuccessful because Cloudflare’s internal DNS services were also degraded and unable to propagate changes. This prevented traffic from being routed directly to AWS.

As a result, Cloudflare became a critical single point of failure, and Basis Theory had no alternative routing path available during the incident.

What Worked and What Didn’t

What Worked

  • Internal alerting correctly detected failures across all products
  • Core infrastructure remained healthy and fully operational
  • Internal traffic and health checks confirmed backend health throughout

What Didn’t Work

  • Synthetic monitoring provided false confidence early in the event due to intermittent Cloudflare recoveries
  • DNS Proxy-disable/fallback routing could not propagate during Cloudflare’s global degradation
  • No independent DNS provider or edge-bypass path existed to re-route traffic during Cloudflare failure
  • Cloudflare’s full ownership of DNS + routing + proxying created complete ingress lock-in
  • Basis Theory’s Status page was also impacted by these issues, causing a delay in communication on the impact of our systems on our customers.

Future Prevention & Next Steps

To ensure an outage of this scale cannot recur, we are implementing several improvements across our operational processes, monitoring strategy, and edge architecture.

  1. Edge Architecture Redesign

    We are re-architecting our ingress and edge routing strategy to eliminate Cloudflare as a single point of failure. This includes introducing multi-provider redundancy, improving automated failover capabilities, and enforcing service-level objectives that trigger autonomous routing changes when error or latency thresholds are exceeded.

  2. Operational Readiness & Response Improvements

    • We have updated our Support and Operations action plans to include clearer escalation paths and explicit procedures for cases where our status page or other external communication systems are degraded.
    • We have validated and refined our disaster-recovery playbooks to ensure we can rapidly execute Cloudflare Edge bypass procedures.
  3. Monitoring and Alerting Enhancements

    We are reconfiguring our monitors to shorten detection windows and improve sensitivity to intermittent, region-specific failures. This will enable us to identify partial outages more quickly and make more informed decisions earlier during edge instability.

  4. Customer-Controlled Bypass Path

    We are developing a fully supported, customer-accessible bypass option that can be activated during severe edge degradation. This will provide customers with a direct path to our infrastructure when needed, even if an edge provider is impaired, and will serve as a last-resort mechanism to ensure traffic can still reach us if other mitigation steps fail.

These changes aim to significantly reduce dependency concentration, accelerate detection and mitigation, and ensure reliable access to Basis Theory services even in the event of large-scale external vendor outages.

Updated
Nov 18 at 05:13pm UTC

Services have been fully restored. A full RCA will be provided within 24 hours.

Updated
Nov 18 at 04:25pm UTC

Nearly all traffic has been restored.

Less than 2% of traffic is still seeing intermittent 500 errors; our edge service provider is continuing to restore full traffic.

Updated
Nov 18 at 03:08pm UTC

We are seeing a large portion of our requests resolve and return successful responses.

We will continue to monitor the situation closely.

Updated
Nov 18 at 02:34pm UTC

We are seeing a large portion of our requests resolve and return successful responses.

We will continue to monitor the situation closely.

Updated
Nov 18 at 01:10pm UTC

Customers are still seeing elevated 500 errors.

We are continuing to investigate alternatives to edge routing outage.

Updated
Nov 18 at 12:52pm UTC

Our systems are still experiencing active degradation, and some customers are encountering intermittent issues.

We are monitoring the situation and actively working on a bypass.

Created
Nov 18 at 11:37am UTC

We are experiencing intermittent failures at our edge service provider, resulting in an increased number of 500 errors. We are monitoring the situation and will report back as soon as we have an update.

Currently, we are seeing a stable platform.