incident responseoutagesDNS

Surviving CDN & Cloud Outages: An Incident Response Playbook

bbitbox

2026-02-04

11 min read

An executable incident runbook for simultaneous AWS, Cloudflare and platform outages — DNS failover, cache warmup, traffic steering, and comms.

Surviving CDN & Cloud Outages: An Incident Response Playbook

Hook: When AWS, Cloudflare, and other major platforms go down at the same time, engineering teams face the worst-case: fragmented toolchains, skyrocketing customer incidents, and finger-pointing between vendors. This playbook gives you an executable incident runbook — from DNS failover to cache warmup, traffic rerouting, and stakeholder communications — tuned for the realities of 2026.

Executive summary — act now, reduce blast radius

In 2026, multi-vector outages and third-party dependency failures are the expected exception. Recent trends in late 2025 and early 2026 show outages increasingly affecting multiple providers simultaneously because of cascading dependencies, misconfigurations, and BGP-level events. The first 30 minutes matter. This runbook prioritizes rapid, low-risk mitigations you can script and rehearse today.

Incident classification & quick decision matrix

Before executing any change, classify the outage and pick one of three modes. Each mode defines an immediate checklist you can run in parallel.

Provider partial outage — Single vendor (e.g., Cloudflare proxy layer) showing degraded performance but DNS and origin reachable. Goal: bypass the failing layer and preserve availability.
Provider widespread outage — DNS or network-level problem affecting many regions (e.g., an AWS control-plane or major CDN outage). Goal: activate multi-provider failover and preserve core services.
Simultaneous multi-platform outage — Multiple major providers are degraded. Goal: limit impact using preprovisioned fallbacks (secondary DNS, geo-failover, scaled origins) and focus communications and containment.

Roles & responsibilities (15–30 second assignment)

Keep titles short and responsibilities explicit. During incidents, assign by name and not by team.

Incident Commander (IC) — Single decision owner. Calls go/no-go, escalates to execs.
SRE/Network Lead — Executes traffic changes (DNS, LB, BGP announcements), validates health checks.
Platform/Dev Lead — Coordinates origin scaling, cache warming, and releases if needed.
Comms Lead — Publishes status pages, vendor escalations, internal updates.
Security Lead — Validates that mitigation does not bypass critical security controls.

Pre-incident preparation (non-negotiable)

The following are preconditions you must have automated and tested. If you don't, prioritize them after the incident; but any response without these is slower and riskier.

Multi-authoritative DNS — Two independent authoritative DNS providers (example: Route 53 + NS1). Use DNSSEC-safe secondary configurations and practice failover. Keep Primary/Secondary TTLs at 60–300s for critical records.
Low-TTL critical records — API endpoints, login pages, and static assets should have 60–300s TTL in normal ops; during high-risk windows you can set to 30s. Document the TTL change process and automation.
Pre-provisioned secondary origins — A hot/warm origin in another provider or region, with origin authentication keys in a secrets manager ready to apply.
Scripted DNS playbooks — CLI scripts (aws route53 change-resource-record-sets, ns1 API calls) that can be executed with one command and audited. Consider storing these as micro-runbooks or templates from a micro-app template pack so they are auditable and repeatable.
Cache-warmup tools and S3 mirror — A job that can prefetch top-10k URIs across PoPs and a readable S3/Blob mirror for static assets.
Runbook tests — Quarterly tabletop exercises and at least one live failover test per year with stakeholders present.

Immediate 0–10 minute actions (stabilize and communicate)

Start triage immediately. Keep actions atomic, reversible, and automatable. Use the two-minute rule: if a recovery step cannot be safely executed and reversed in two minutes, defer it and pick the next step.

Declare incident — IC declares severity (sev1/sev2), starts an incident channel, and posts a 1-line status to execs: scope, impact, and next update time.
Enable vendor support escalation — Open emergency tickets and request senior engineer engagement. Get ticket numbers and expected response SLAs. Maintain a list of pre-approved contacts to reduce vendor onboarding friction (reducing partner onboarding friction).
Collect telemetry — Pull graph snapshots: 5xx rate, latency histograms, traffic volume, CDN cache-hit ratio, and DNS query volume. Store these in incident storage immediately. Instrumentation and guardrails matter; see an example of cost-aware telemetry work in the field (how we reduced query spend).
Assess provider impact — Confirm if DNS is authoritative and responding globally. Use external checkers (e.g., RIPE Atlas, public dig from multiple regions, DownDetector signals) to map scope.
Announce public status — Comms Lead posts an initial status page and social message acknowledging the issue and that you are investigating. Keep language factual and avoid blame.

10–30 minutes: DNS failover and traffic steering

If the outage affects CDN edge or DNS, act fast to reroute traffic. The simplest, safest, fastest changes are DNS changes — provided you have prepared multi-provider DNS.

When Cloudflare proxy (CDN) is down but origin and DNS work

Bypass the CDN proxy if Cloudflare’s proxy layer is degraded but its DNS service is okay.

Switch records from proxied to DNS-only (if using Cloudflare): flip the orange cloud to gray in the dashboard or via API to remove Cloudflare’s reverse proxy and serve directly from origin.
Ensure origin accepts direct traffic (Host header, TLS certificate, rate limits). If TLS is issued by Cloudflare, you must have a cert for the origin already installed.
Shorten TTLs and monitor 5xx/latency spikes.

When DNS or Cloud provider control plane is down

If authoritative DNS or a key cloud provider is down, activate secondary DNS and traffic steering.

Activate secondary DNS via your preconfigured provider. Example: using AWS Route 53 to set weighted records pointing to a secondary origin. Execute pre-approved CLI script (example below).
Use geo-steering or weighted failover to move traffic away from affected regions/providers. For large customers, consider moving to a secondary cloud's load balancer (GCP/AKS) that has prep'd backend pools. Use a traffic manager that supports PoP-level decisions and geo-steering / micro-map orchestration when you need granular regional control.
Monitor propagation — Use global dig checks and RIPE/Atlas to validate the change. Adjust TTLs only if necessary.

Example Route 53 change (weighted failover):

{
  "Comment": "Weighted failover to secondary origin",
  "Changes": [
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "www.example.com.",
        "Type": "A",
        "SetIdentifier": "primary-origin",
        "Weight": 0,
        "TTL": 60,
        "ResourceRecords": [{"Value": "203.0.113.10"}]
      }
    },
    {
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "www.example.com.",
        "Type": "A",
        "SetIdentifier": "secondary-origin",
        "Weight": 100,
        "TTL": 60,
        "ResourceRecords": [{"Value": "198.51.100.20"}]
      }
    }
  ]
}

Execute with: aws route53 change-resource-record-sets --hosted-zone-id Z123 --change-batch file://failover.json

When BGP-level or IP anycast issues occur

BGP announcements and ASN-level actions are high-risk and should be pre-approved. If you operate your own IP space, activate alternate announcements from a secondary ASN or cloud partner. If not, prefer DNS-based steering or traffic managers. For organizations that need operational playbooks and pre-approved change windows, see guidance on broader operational controls and pre-approval processes (operational playbook and pre-approval patterns).

30–90 minutes: Cache warmup and origin scaling

Once traffic is being directed to your chosen path, reduce latency by warming caches and protecting origin capacity.

Evaluate cache hit ratios — If your CDN is available partially, check PoP-level hit ratios. Low hit rates require immediate warming.
Warm caches — Run a controlled prefetch of top N URIs with respecting origin rate limits. Use distributed runners (Lambda@Edge, Cloud Functions in multiple regions, or small VMs) to prime PoPs and avoid origin overload.
Activate origin autoscaling — Increase instance counts and bursting capacity. Apply burst protection (rate-limits, request queueing) to prevent cascading failures.
Enable stale content policies — Serve stale content (Cache-Control: stale-while-revalidate, stale-if-error) until caches are fully warmed.

Sample cache-warmup approach:

Export the top 5k URIs by traffic from logs (last 72 hours).
Shard the list across 50 regional workers and fetch with HEAD then GET for small payloads.
Throttle each worker to avoid >5 RPS to the origin per IP.

# pseudo-script
for worker in regions:
  split uris into shards
  for uri in shard:
    curl -H "User-Agent: cache-warmup" -I https://www.example.com${uri}
    sleep 0.2

90–180 minutes: Stabilize, iterate, and communicate

With traffic rerouted and caches warming, focus on stability and transparent communication.

Continuous validation: Run synthetic transactions (login, checkout) from multiple regions and publish green/red status to internal dashboards.
Gradual rollback: If performance is poor, revert to previous DNS/LB state; do rollbacks during low-traffic windows if possible.
Public updates: Comms Lead posts updates at predictable intervals (15/30/60 mins initially, then hourly). Provide impact, mitigations, ETA, and what customers should expect.

Post-incident: Postmortem and vendor SLA actions

After service is restored, perform a rigorous postmortem. Include vendor interactions and request SLA credits where applicable.

Postmortem checklist

Timeline — Second-by-second timeline of detection, decisions, and changes (annotated with UTC timestamps).
Root cause analysis — Distinguish root cause vs contributing factors; include vendor-provided RCA and any internal misconfigurations.
Data collection — Preserve logs, pcap captures (if applicable), configuration states, and API call histories. Keep backups and incident artifacts in an offline-first incident store to avoid losing evidence if cloud consoles are unreachable.
Action items — Specific, assigned remediation tasks with owners and deadlines (e.g., add secondary DNS by Q2, implement automated failover scripts by next sprint).
SLA claims — Document outage duration, affected SLAs, and prepare formal claims with vendors. Attach timelines and impact metrics. For procurement and buying-side implications of incident response contracts, see the recent public procurement draft guidance.

Key metrics to compute

MTTD (mean time to detect)
MTTR (mean time to recover)
SLA breach time and expected credits
Peak error rate and user-impacted sessions
Traffic delta and cache hit ratio change

Advanced strategies for 2026 and beyond

Given current trends, teams must move beyond ad-hoc fixes to resilient architectures.

1. Multi-authority DNS with automated reconciliation

Use DNS providers that offer programmatic secondary failover and APIs. Implement a reconciliation service that ensures DNS records across providers stay in sync and provide an audit trail for changes.

2. Multi-edge strategy

Don’t rely on a single CDN. Use a traffic manager to split traffic across Fastly, Cloudflare, and Akamai with health checks and automated failover. In 2026, multi-edge orchestration SDKs and vendors emerged that let you programmatically switch traffic at PoP granularity — adopt them for high-value services.

3. Origin resilience & mesh

Run hot/warm origins in at least two clouds or use neutral hosting with IP transit to multiple CDNs. Maintain consistent configuration via GitOps and secrets in central stores to enable rapid promotion.

4. BGP preparedness (if you own IP space)

If you run your own IP addresses, create pre-authorized partner peering for emergency announcements. Practice BGP change windows in non-critical times and maintain a trusted list of partner engineers who can announce on your behalf.

5. Observability-first failover automation

Automate failover triggers with clear guardrails: a failover policy should require at least two independent telemetry signals (e.g., 5xx surge + region-wide DNS failures) to prevent flip-flopping during transient errors. Balance automation with human oversight and the lessons from trust-and-automation discussions (trust, automation, and human editors).

Communication templates & cadence

Clarity and cadence reduce user frustration and executive pressure.

Initial public status (first 10 minutes)

We are investigating a service degradation impacting specific regions for www.example.com. Our engineering team is actively working with providers. Next update in 30 minutes. — Status page

15–30 minute internal update

IC posts: impact summary, actions taken, next steps, and required approvals. Use bullet points and attach the incident log link.

Resolution update

All services have been restored. We routed traffic to the secondary origin and warmed caches. A full postmortem will be published in 72 hours with remediation actions. — Status page

Real-world example (anonymized)

During a late-2025 multi-provider control-plane outage, a mid-market SaaS provider used this runbook to reduce customer impact from 3 hours to 18 minutes. Key steps they executed: switch to secondary DNS (automated), flip CDN to DNS-only for core endpoints, and run a 10k-URI warmup from 30 regional workers. Their postmortem showed that pre-provisioned origin certs and a documented rollback plan were the difference between a sev1 and a manageable incident.

Common pitfalls and how to avoid them

Long TTLs — A TTL of hours prevents fast failover. Keep critical records low and use automation to raise TTLs outside incident windows.
Unprepared origin certs — If origin TLS relies only on the CDN's cert, direct traffic fails. Always deploy certificates to origins.
Manual only processes — Manual DNS edits or BGP plays are error-prone. Automate and test.
No vendor escalation path — Ensure you have vendor escalation contacts and recorded support-level agreements.

Checklist: 20-minute executable runbook

Declare incident and assign IC (0–1 min)
Open vendor escalations and collect ticket IDs (0–3 min)
Run prewritten DNS failover script to secondary DNS (1–5 min)
Toggle CDN proxy to DNS-only if applicable (2–6 min)
Start cache warmup job on distributed workers (5–20 min)
Scale origin pool and enable rate-limits (5–15 min)
Publish public status and update every 30 minutes (5–20 min)

Final takeaways

Outages that touch DNS, CDN, and cloud providers simultaneously are no longer hypothetical. The right combination of preparation (multi-authoritative DNS, pre-provisioned origins, certificate readiness), automation (scripted DNS/LB playbooks, cache-warmup jobs), and disciplined communications will reduce MTTR and customer impact dramatically.

Actionable next steps: run a chaired failover drill this quarter; ensure critical records have low TTL and are mirrored across two DNS providers; preinstall TLS certs on origins; and build a one-command DNS failover script with audit logging.

Call to action

If you manage production traffic, don't wait for the next headline outage. Download our ready-to-run 20-minute incident playbook and run a live tabletop this month. Contact bitbox.cloud for an incident readiness review and automated DNS failover implementation tailored to your stack.

bitbox

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.