Incident Response for Domains: What to Do When an External Provider Breaks Your Site
incident responseoperationscommunications

Incident Response for Domains: What to Do When an External Provider Breaks Your Site

ccrazydomains
2026-01-24 12:00:00
10 min read
Advertisement

Runbook for domain admins: triage, DNS mitigations, SSL checks, comms templates, and rollbacks during third‑party outages.

Hook: Your site is down and the provider says it’s their problem. Now what?

Nothing is more stomach‑tightening than a midnight alert that your app is unreachable and the error points to an external provider. Third‑party outages are inevitabilities in 2026: major edge and DNS providers still fail, and recent spikes in late 2025 and January 2026 showed how quickly dependencies cascade. This runbook is built for domain and hosting admins who need pragmatic, repeatable steps to triage, mitigate, communicate, and recover when someone else breaks your site.

Top‑level playbook overview

Follow the inverted pyramid: prioritize restoring customer-facing traffic, keep stakeholders informed, prevent repeated flips, then do root cause analysis. These are the core phases.

  1. Immediate triage 0-15 minutes: Identify the blast radius and confirm it is a third‑party outage.
  2. Mitigate 15-60 minutes: Use DNS, CDN, and certificate workarounds to restore reachability.
  3. Communicate ongoing: Publish status updates and internal incident notes on a cadence.
  4. Stabilize and rollback 1-4 hours: Roll back faulty changes or route around the provider.
  5. Postmortem 24-72 hours: Document root cause, gaps, and long‑term fixes like multi‑provider failover.

Why this matters in 2026

Edge computing, managed DNS, and CDN services have become even more central to site architecture. That centralization reduces ops friction but increases systemic risk. Late 2025 and early 2026 outages reminded teams that relying on a single provider is a single point of failure. Trends to account for:

  • Wider adoption of multi‑CDN and multi DNS strategies to meet stronger SLOs.
  • Increased use of DNS over HTTPS clients, which changes how TTLs and propagation behave for some user agents.
  • Automated certificate issuance via ACME remains dominant, but certificate distribution models now include zero‑touch edge certs which require new checks during outages.
  • APIs for DNS and hosting are now the primary path for automated failover; manual dashboards are too slow. See our guidance on APIs for DNS and hosting and scripting emergency flows.

Immediate triage checklist 0-15 minutes

Start with data, not assumptions. Use these quick checks to confirm whether the issue is provider side or your origin.

  1. Confirm the blast radius
    • Check your synthetic monitoring and real user monitoring dashboards for geographies and error classes.
    • Check public outage aggregators and social signals for the provider named in errors. Recent high‑visibility outages happened in January 2026 and were correlated across edge and DNS providers.
  2. Basic network and DNS checks
    dig +short yourdomain.com A @8.8.8.8
    dig +short yourdomain.com NS
    curl -I https://yourdomain.com --max-time 10
    traceroute -n yourdomain.com

    Look for DNS timeouts, unexpected name servers, or HTTP connection failures at the network layer.

  3. Certificate check
    echo | openssl s_client -connect yourdomain.com:443 -servername yourdomain.com 2>/dev/null | openssl x509 -noout -dates

    Is the cert valid? Are OCSP stapling errors or expired certs present?

  4. Check provider status
    • Open the provider status page and API; check recent incident messages and maintenance notices.
    • If their status page is down, cross‑reference Twitter/X and community channels for confirmation.

Decision tree: Is this really the provider?

Quick decision flow to decide next steps.

  1. If DNS lookups fail across multiple resolvers and provider status shows an outage, treat as provider DNS outage.
  2. If DNS is correct but HTTP/TLS fails with TCP connection errors and provider status shows CDN/edge outage, treat as edge outage.
  3. If only your origin is failing and DNS/CDN show healthy, treat this as origin outage and follow host recovery path.

Mitigation options 15-60 minutes

Pick the least invasive option that restores reachability. Try these in order until the site is reachable.

1. Use a secondary DNS provider for emergency failover

If you already have slave/secondary DNS or an API-enabled secondary provider, promote secondary records or switch glue records depending on registrar features. For most teams the fastest path is updating the zone at the secondary DNS provider and switching registrar nameserver settings if the registrar allows quick changes via API.

  • Automate this: keep scripts that swap NS records via registrar API and monitor propagation.
  • Reduce risk: keep TTLs for critical records at 60-300 seconds during normal operations if you plan rapid failover.

2. Point traffic to origin directly via an emergency A/ALIAS record

If CDN or edge is down, and origin can safely receive traffic from the public internet, add an A or ALIAS record pointing to the origin IP. Consider header and host header requirements for virtual hosts.

ip route check && dig +short @1.1.1.1 yourdomain.com
# Update DNS using your provider API to set an A record to origin IP

Risks: you may expose origin IP, and you might bypass rate limits and WAF protections.

3. Shorten TTLs and rely on low‑TTL records

If you can anticipate switching providers, keep critical records with moderate TTLs. If you need agility, set TTL to 60-300 seconds during incident windows, then raise after stability.

4. Bypass failing provider with an alternate CDN or multi‑CDN routing

If you have multi‑CDN configured, shift traffic using your load balancer or DNS traffic steering. If you don't have multi‑CDN but have an alternate provider ready, update DNS records to point to the alternate provider. Use weighted routing to reduce risk during cutover. For practical notes on multi‑provider setups and cost tradeoffs, see edge caching & cost control.

5. Emergency certificate and TLS workarounds

When edge certs fail, you can terminate TLS at origin with a valid cert and use an A record to route traffic directly, or upload the same cert to the backup CDN. If your origin only supports self‑signed certs, ensure clients trust your interim certs only if absolutely necessary.

# Check certificate chain and stapling
openssl s_client -connect yourdomain.com:443 -servername yourdomain.com -status

Note: OCSP and stapling errors can indicate provider side TLS issues even when certs are valid.

Communications: templates and cadence

Communication is triage too. Keep messages short, transparent, and on a regular cadence. Publish to your status page and internal channels.

Initial external message 0-30 minutes

We are aware of access issues affecting our site and are investigating. Early signals indicate an outage involving an external provider. We will post updates every 30 minutes until resolved. Severity: high. Impact: customers may see intermittent errors or timeouts.

Follow‑up updates every 30–60 minutes

Update: We identified the impact as a provider DNS/CDN outage. We are executing failover steps and expect to restore partial service within the hour. We will provide another update at HH:MM UTC.

Post‑recovery message

Service restored: Traffic is now routed through an alternate path. We are monitoring for stability and will publish a full incident report within 72 hours.

Internal incident update template

Incident ID: XXX Scope: List of affected services Timeline: Timestamps of detection, mitigation steps, current status Next actions: Who is responsible and ETA

Always link to your status page and provide a way for customers to contact support for outages impacting SLAs.

Operational commands and checks to run

Here are practical commands for quick diagnosis. Run from different networks to isolate propagation effects.

# DNS
dig +trace yourdomain.com
dig @1.1.1.1 yourdomain.com ANY

# HTTP/TLS
curl -v --resolve yourdomain.com:443:203.0.113.10 https://yourdomain.com/

# Certificate
echo | openssl s_client -connect yourdomain.com:443 -servername yourdomain.com 2>/dev/null | openssl x509 -noout -text

# Network path
mtr --report --report-cycles 10 yourdomain.com

Rollback plans and when to execute them

Rolling back changes is sometimes safer than complex failovers. Keep rollback plans for these scenarios.

  • Failed deployment caused config drift Roll back to the last known good config and keep DNS untouched until traffic stabilizes.
  • Third‑party provider misconfiguration If the provider misapplied config that you pushed, revert your changes and ask the provider to reload without your configuration, then redeploy in a staged manner.
  • Certificate deployment broke TLS Reinstall the previous cert and reenable OCSP stapling once the provider is stable.

Always test rollback in a staging environment. For DNS rollbacks, raise TTLs after a successful rollback to reduce unnecessary churn.

Automation and APIs you should have ready

In 2026, manual UIs are too slow. Keep scripts and runbooks that leverage provider APIs, and store minimal access keys for emergency use.

  • Registrar API key for mutating NS records quickly.
  • DNS provider API tokens for scripted record changes and health checks.
  • CDN or edge provider API to purge cache, change origin, or disable features like WAF if they block traffic erroneously.
  • Monitoring and incident platform APIs to automatically create incidents and post status updates.

Hardening to prevent future outages

After recovery, implement durable changes to reduce time to restore next time.

  1. Multi‑provider DNS and multi‑CDN with automated failover and health checks.
  2. Short but realistic TTL strategy for critical records during incident windows, and longer TTLs during calm operations.
  3. Origin exposure policy with emergency NAT IPs or VPNs so origin can accept traffic without exposing private networks.
  4. Certificate resilience by ensuring ACME/issuers and alternative certs are available and keys are deployable programmatically.
  5. Chaos and failover drills run quarterly — simulate provider failures and measure time to restore and communication quality. See our notes on observability for offline features to instrument drills.

Postmortem and SRE notes

Don’t skip root cause analysis. Your postmortem should include:

  • Timeline with exact timestamps and who did what.
  • Why the provider outage affected you and what in your architecture amplified impact.
  • Actions categorized into short term, medium term, and long term with owners and deadlines.
  • Updated runbook snippets and playbooks added to runbook repository and automated tests.

Quick checklist you can print and tape to your monitor

  • Confirm blast radius and provider status
  • Run dig, curl, openssl s_client, traceroute
  • Decide failover path: secondary DNS, alternate CDN, direct origin
  • Spin up emergency cert or reapply previous cert if TLS broken
  • Post initial status, update every 30 minutes
  • Rollback if configuration change introduced the issue
  • Publish postmortem and schedule a chaos drill

Developer notes and gotchas

  • Avoid lowering TTLs broadly during normal operations unless you have tested failovers; too low TTLs increase DNS query costs and can destabilize caches.
  • When switching NS records, registrar propagation can be slower than expected. Use registrar APIs where possible and plan for up to 24 hours in worst cases.
  • DNS over HTTPS clients may not respect short TTLs the same way legacy resolvers do. Test with real browsers and mobile clients.
  • Be careful exposing origin IPs; reputation and security controls like WAF and rate limiting must be reenabled quickly.

Real‑world example, simplified

During a January 2026 edge provider outage, several customers reported 502 and TLS errors. One hosting team used this runbook: they confirmed the provider was the issue via public monitors, switched to a secondary DNS provider via API within 20 minutes, and pointed A records to protected origin IPs while rotating certs to origin termination. They posted status updates every 30 minutes and had full traffic restored in 55 minutes. Postmortem led to a multi‑CDN contract and quarterly failover drills.

Actionable takeaways

  • Prepare automation ahead of incidents: registrar and DNS API scripts, alternate CDN setup, and stored emergency certs.
  • Keep communication templates ready and follow a strict cadence during incidents.
  • Practice failovers quarterly so the team knows the exact steps and time to restore.
  • After recovery, harden: multi‑provider strategies, origin exposure policies, and certificate redundancy.

Closing note and call to action

Third‑party outages will keep happening. The difference between a minor blip and a major incident is how prepared you are. Use this runbook, automate the painful parts, and practice until the whole team can restore service without panic. Start by running a simulated provider outage this quarter and measure time to restore.

Ready to test your runbook? Sign up for a free runbook health check and a simulated provider failover workshop at crazydomains.cloud or schedule a 1:1 with our uptime engineers to build your automated DNS and TLS failover scripts.

Advertisement

Related Topics

#incident response#operations#communications
c

crazydomains

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-01-24T03:30:58.596Z