Incident Response for Domains: What to Do When an External Provider Breaks Your Site
Runbook for domain admins: triage, DNS mitigations, SSL checks, comms templates, and rollbacks during third‑party outages.
Hook: Your site is down and the provider says it’s their problem. Now what?
Nothing is more stomach‑tightening than a midnight alert that your app is unreachable and the error points to an external provider. Third‑party outages are inevitabilities in 2026: major edge and DNS providers still fail, and recent spikes in late 2025 and January 2026 showed how quickly dependencies cascade. This runbook is built for domain and hosting admins who need pragmatic, repeatable steps to triage, mitigate, communicate, and recover when someone else breaks your site.
Top‑level playbook overview
Follow the inverted pyramid: prioritize restoring customer-facing traffic, keep stakeholders informed, prevent repeated flips, then do root cause analysis. These are the core phases.
- Immediate triage 0-15 minutes: Identify the blast radius and confirm it is a third‑party outage.
- Mitigate 15-60 minutes: Use DNS, CDN, and certificate workarounds to restore reachability.
- Communicate ongoing: Publish status updates and internal incident notes on a cadence.
- Stabilize and rollback 1-4 hours: Roll back faulty changes or route around the provider.
- Postmortem 24-72 hours: Document root cause, gaps, and long‑term fixes like multi‑provider failover.
Why this matters in 2026
Edge computing, managed DNS, and CDN services have become even more central to site architecture. That centralization reduces ops friction but increases systemic risk. Late 2025 and early 2026 outages reminded teams that relying on a single provider is a single point of failure. Trends to account for:
- Wider adoption of multi‑CDN and multi DNS strategies to meet stronger SLOs.
- Increased use of DNS over HTTPS clients, which changes how TTLs and propagation behave for some user agents.
- Automated certificate issuance via ACME remains dominant, but certificate distribution models now include zero‑touch edge certs which require new checks during outages.
- APIs for DNS and hosting are now the primary path for automated failover; manual dashboards are too slow. See our guidance on APIs for DNS and hosting and scripting emergency flows.
Immediate triage checklist 0-15 minutes
Start with data, not assumptions. Use these quick checks to confirm whether the issue is provider side or your origin.
- Confirm the blast radius
- Check your synthetic monitoring and real user monitoring dashboards for geographies and error classes.
- Check public outage aggregators and social signals for the provider named in errors. Recent high‑visibility outages happened in January 2026 and were correlated across edge and DNS providers.
- Basic network and DNS checks
dig +short yourdomain.com A @8.8.8.8 dig +short yourdomain.com NS curl -I https://yourdomain.com --max-time 10 traceroute -n yourdomain.com
Look for DNS timeouts, unexpected name servers, or HTTP connection failures at the network layer.
- Certificate check
echo | openssl s_client -connect yourdomain.com:443 -servername yourdomain.com 2>/dev/null | openssl x509 -noout -dates
Is the cert valid? Are OCSP stapling errors or expired certs present?
- Check provider status
- Open the provider status page and API; check recent incident messages and maintenance notices.
- If their status page is down, cross‑reference Twitter/X and community channels for confirmation.
Decision tree: Is this really the provider?
Quick decision flow to decide next steps.
- If DNS lookups fail across multiple resolvers and provider status shows an outage, treat as provider DNS outage.
- If DNS is correct but HTTP/TLS fails with TCP connection errors and provider status shows CDN/edge outage, treat as edge outage.
- If only your origin is failing and DNS/CDN show healthy, treat this as origin outage and follow host recovery path.
Mitigation options 15-60 minutes
Pick the least invasive option that restores reachability. Try these in order until the site is reachable.
1. Use a secondary DNS provider for emergency failover
If you already have slave/secondary DNS or an API-enabled secondary provider, promote secondary records or switch glue records depending on registrar features. For most teams the fastest path is updating the zone at the secondary DNS provider and switching registrar nameserver settings if the registrar allows quick changes via API.
- Automate this: keep scripts that swap NS records via registrar API and monitor propagation.
- Reduce risk: keep TTLs for critical records at 60-300 seconds during normal operations if you plan rapid failover.
2. Point traffic to origin directly via an emergency A/ALIAS record
If CDN or edge is down, and origin can safely receive traffic from the public internet, add an A or ALIAS record pointing to the origin IP. Consider header and host header requirements for virtual hosts.
ip route check && dig +short @1.1.1.1 yourdomain.com # Update DNS using your provider API to set an A record to origin IP
Risks: you may expose origin IP, and you might bypass rate limits and WAF protections.
3. Shorten TTLs and rely on low‑TTL records
If you can anticipate switching providers, keep critical records with moderate TTLs. If you need agility, set TTL to 60-300 seconds during incident windows, then raise after stability.
4. Bypass failing provider with an alternate CDN or multi‑CDN routing
If you have multi‑CDN configured, shift traffic using your load balancer or DNS traffic steering. If you don't have multi‑CDN but have an alternate provider ready, update DNS records to point to the alternate provider. Use weighted routing to reduce risk during cutover. For practical notes on multi‑provider setups and cost tradeoffs, see edge caching & cost control.
5. Emergency certificate and TLS workarounds
When edge certs fail, you can terminate TLS at origin with a valid cert and use an A record to route traffic directly, or upload the same cert to the backup CDN. If your origin only supports self‑signed certs, ensure clients trust your interim certs only if absolutely necessary.
# Check certificate chain and stapling openssl s_client -connect yourdomain.com:443 -servername yourdomain.com -status
Note: OCSP and stapling errors can indicate provider side TLS issues even when certs are valid.
Communications: templates and cadence
Communication is triage too. Keep messages short, transparent, and on a regular cadence. Publish to your status page and internal channels.
Initial external message 0-30 minutes
We are aware of access issues affecting our site and are investigating. Early signals indicate an outage involving an external provider. We will post updates every 30 minutes until resolved. Severity: high. Impact: customers may see intermittent errors or timeouts.
Follow‑up updates every 30–60 minutes
Update: We identified the impact as a provider DNS/CDN outage. We are executing failover steps and expect to restore partial service within the hour. We will provide another update at HH:MM UTC.
Post‑recovery message
Service restored: Traffic is now routed through an alternate path. We are monitoring for stability and will publish a full incident report within 72 hours.
Internal incident update template
Incident ID: XXX Scope: List of affected services Timeline: Timestamps of detection, mitigation steps, current status Next actions: Who is responsible and ETA
Always link to your status page and provide a way for customers to contact support for outages impacting SLAs.
Operational commands and checks to run
Here are practical commands for quick diagnosis. Run from different networks to isolate propagation effects.
# DNS dig +trace yourdomain.com dig @1.1.1.1 yourdomain.com ANY # HTTP/TLS curl -v --resolve yourdomain.com:443:203.0.113.10 https://yourdomain.com/ # Certificate echo | openssl s_client -connect yourdomain.com:443 -servername yourdomain.com 2>/dev/null | openssl x509 -noout -text # Network path mtr --report --report-cycles 10 yourdomain.com
Rollback plans and when to execute them
Rolling back changes is sometimes safer than complex failovers. Keep rollback plans for these scenarios.
- Failed deployment caused config drift Roll back to the last known good config and keep DNS untouched until traffic stabilizes.
- Third‑party provider misconfiguration If the provider misapplied config that you pushed, revert your changes and ask the provider to reload without your configuration, then redeploy in a staged manner.
- Certificate deployment broke TLS Reinstall the previous cert and reenable OCSP stapling once the provider is stable.
Always test rollback in a staging environment. For DNS rollbacks, raise TTLs after a successful rollback to reduce unnecessary churn.
Automation and APIs you should have ready
In 2026, manual UIs are too slow. Keep scripts and runbooks that leverage provider APIs, and store minimal access keys for emergency use.
- Registrar API key for mutating NS records quickly.
- DNS provider API tokens for scripted record changes and health checks.
- CDN or edge provider API to purge cache, change origin, or disable features like WAF if they block traffic erroneously.
- Monitoring and incident platform APIs to automatically create incidents and post status updates.
Hardening to prevent future outages
After recovery, implement durable changes to reduce time to restore next time.
- Multi‑provider DNS and multi‑CDN with automated failover and health checks.
- Short but realistic TTL strategy for critical records during incident windows, and longer TTLs during calm operations.
- Origin exposure policy with emergency NAT IPs or VPNs so origin can accept traffic without exposing private networks.
- Certificate resilience by ensuring ACME/issuers and alternative certs are available and keys are deployable programmatically.
- Chaos and failover drills run quarterly — simulate provider failures and measure time to restore and communication quality. See our notes on observability for offline features to instrument drills.
Postmortem and SRE notes
Don’t skip root cause analysis. Your postmortem should include:
- Timeline with exact timestamps and who did what.
- Why the provider outage affected you and what in your architecture amplified impact.
- Actions categorized into short term, medium term, and long term with owners and deadlines.
- Updated runbook snippets and playbooks added to runbook repository and automated tests.
Quick checklist you can print and tape to your monitor
- Confirm blast radius and provider status
- Run dig, curl, openssl s_client, traceroute
- Decide failover path: secondary DNS, alternate CDN, direct origin
- Spin up emergency cert or reapply previous cert if TLS broken
- Post initial status, update every 30 minutes
- Rollback if configuration change introduced the issue
- Publish postmortem and schedule a chaos drill
Developer notes and gotchas
- Avoid lowering TTLs broadly during normal operations unless you have tested failovers; too low TTLs increase DNS query costs and can destabilize caches.
- When switching NS records, registrar propagation can be slower than expected. Use registrar APIs where possible and plan for up to 24 hours in worst cases.
- DNS over HTTPS clients may not respect short TTLs the same way legacy resolvers do. Test with real browsers and mobile clients.
- Be careful exposing origin IPs; reputation and security controls like WAF and rate limiting must be reenabled quickly.
Real‑world example, simplified
During a January 2026 edge provider outage, several customers reported 502 and TLS errors. One hosting team used this runbook: they confirmed the provider was the issue via public monitors, switched to a secondary DNS provider via API within 20 minutes, and pointed A records to protected origin IPs while rotating certs to origin termination. They posted status updates every 30 minutes and had full traffic restored in 55 minutes. Postmortem led to a multi‑CDN contract and quarterly failover drills.
Actionable takeaways
- Prepare automation ahead of incidents: registrar and DNS API scripts, alternate CDN setup, and stored emergency certs.
- Keep communication templates ready and follow a strict cadence during incidents.
- Practice failovers quarterly so the team knows the exact steps and time to restore.
- After recovery, harden: multi‑provider strategies, origin exposure policies, and certificate redundancy.
Closing note and call to action
Third‑party outages will keep happening. The difference between a minor blip and a major incident is how prepared you are. Use this runbook, automate the painful parts, and practice until the whole team can restore service without panic. Start by running a simulated provider outage this quarter and measure time to restore.
Ready to test your runbook? Sign up for a free runbook health check and a simulated provider failover workshop at crazydomains.cloud or schedule a 1:1 with our uptime engineers to build your automated DNS and TLS failover scripts.
Related Reading
- Edge Caching & Cost Control for Real‑Time Web Apps in 2026
- Advanced Strategies: Observability for Mobile Offline Features (2026)
- Deploying Offline-First Field Apps on Free Edge Nodes — 2026 Strategies for Reliability and Cost Control
- Celebrity Podcasts and Gaming: What Ant & Dec’s Move Says About the Market for Big-Name Gaming Content
- Themed Dating Game: 'Rom-Com Holiday Mixer' — Rules, Rounds, and Prize Ideas
- Where Politics Meets Campus Life: What Visitors Should Know About University Controversies in the US
- The Warm-Up Checklist: Preparing Your Car (and Tyres) for Cold Weather Comfort
- From Daily Drops to Daily Players: Lessons NFT Artists Can Learn from Beeple for Game Launches
Related Topics
crazydomains
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you