networktroubleshootingforensics

Network Forensics After an Outage: Tracing Errors from BGP to Browser

UUnknown

2026-02-09

11 min read

Practical playbook for post-outage forensics: collect BGP traces, CDN edge logs, tcpdump pcaps and browser NetLogs to pinpoint mass outages fast.

Network Forensics After an Outage: Tracing Errors from BGP to Browser

Hook: When a mass outage strikes, you don't have minutes—you have a collapsing timeline of customer complaints, fragmented logs, and blame-shifting between ISPs, CDNs, and application teams. This guide gives experienced engineers a step-by-step, technical playbook for collecting the precise artifacts you need—BGP traces, packet captures, CDN edge diagnostics, and browser logs—to quickly pinpoint who (or what) broke and why.

TL;DR — What to do first (inverted pyramid)

Stabilize and preserve evidence. Freeze rolling log rotation, start ring-buffer pcaps, capture CDN edge logs and browser HAR/NetLog files.
Correlate timelines. Ensure NTP-sync, normalize timestamps to UTC, build a timeline of user errors vs infra events.
Check global routing. Query route collectors and Looking Glasses for BGP withdrawals/announcements and RPKI validation alerts.
Capture network traffic. Use tcpdump/tshark at edge and origin with filters tuned to your outage signature.
Pull CDN diagnostics. Collect edge logs (real-time log streaming), edge request-IDs and POP-level metrics.
Collect browser-level evidence. HAR/NetLog/RUM traces from affected clients to prove experienced behavior.
Correlate and attribute. Match request IDs, IPs, AS paths and timestamps to find the root cause.

Why this matters in 2026

Late 2025 and early 2026 saw accelerated adoption of eBPF-based observability, wider RPKI rollouts by Tier-1 ISPs, and deeper CDN edge observability APIs. That means SREs now have more granular telemetry—if you collect it fast. At the same time, browser privacy changes and stricter telemetry limits mean client-side logs are more valuable and slightly harder to get. This walkthrough assumes modern toolchains (tcpdump/tshark, eBPF tools, RIPE/RouteViews APIs, CDN edge diagnostics, and browser NetLog/HAR capture) and shows how to bring them together.

Prepare: secure the scene and synchronize time

Before deep-dives, make sure the data you'll collect can be trusted.

Stop destructive rotation: Temporarily disable or change log rotation on critical services so logs remain intact for forensic review.
Time sync: Confirm NTP/chrony is running on all involved systems. Use `chronyc tracking` or `ntpq -p` and log offsets. If hosts are off by more than a second, add offsets to your timeline. (See guidance on embedded Linux time sync techniques for constrained hosts: embedded Linux time sync.)
Chain of custody: Hash logs and pcaps (sha256sum), store original copies read-only (S3 with SSE), and note who accessed evidence.

Step 1 — Quick global checks: BGP and DNS

Mass outages often look like application failures but are network-level: route leaks, BGP withdrawals, DNS propagation failures, or CDN control-plane issues. Start with global reachability.

Check BGP status (fast)

Query public collectors: RIPE RIS, RouteViews, and BGPStream. These show whether your prefixes were announced or withdrawn globally.
Use Looking Glass services for major IXPs and transit providers. They reveal how the prefix appears from different ASes.

Practical commands and APIs:

RIPEstat: curl "https://stat.ripe.net/data/looking-glass/data.json?resource=YOUR_PREFIX"
BGPStream (Python): use the API to fetch updates for your prefix during the incident window.

What to look for:

Sudden withdrawals of your /24s or announcements from unexpected ASes (possible leak).
AS path changes or prepends that coincide with the outage timeframe.
RPKI invalids or validation churn indicating routing issues—check if your prefixes were marked INVALID during the window.

DNS sanity checks

DNS is another common culprit for mass outages—misconfigured records, TTLs, or DNS provider control-plane failures.

Dig from multiple vantage points: `dig +trace example.com` and use public resolvers (1.1.1.1, 8.8.8.8) and set `@` for ISP checks.
Check authoritative NS responses and SOA serials across regions to catch replication issues.
Inspect CDN-provided CNAME chains—an expired DNS entry to an edge domain can route traffic to nowhere.

Step 2 — Capture the network: tcpdump, tshark and packet capture best practices

When BGP and DNS checks don’t immediately show the root cause, capture packets at the edge and origin. Proper capture is critical; a poorly configured tcpdump yields either noise or nothing.

Design your capture

Location: capture at the CDN edge if possible, at your edge router, and at origin servers.
Filters: capture only relevant ports (80/443/53) and IPs to keep sizes manageable.
Ring buffer and segmentation: use -C and -W to limit storage and preserve rolling history.

Example tcpdump commands

tcpdump -i eth0 -s 0 -w /tmp/outage-%Y%m%d-%H.pcap -C 200 -W 12 'tcp port 80 or tcp port 443 or port 53'

Notes:

-s 0 captures full frames (necessary for TLS/HTTP analysis).
-C 200 creates 200MB files; -W 12 keeps 12 files (ring buffer).
For reduced privacy exposure, use Tcpflow or tshark to extract HTTP headers only.

Quick tshark filters to extract indicators

HTTP status distribution:

tshark -r outage.pcap -Y 'http.response.code' -T fields -e http.response.code | sort | uniq -c

TLS handshake failures:

tshark -r outage.pcap -Y 'tls.handshake.type==2' -T fields -e tls.record.version -e ip.src -e ip.dst

DNS errors (NXDOMAIN/SERVFAIL):

tshark -r outage.pcap -Y 'dns.flags.rcode != 0' -T fields -e dns.flags.rcode -e dns.qry.name

Step 3 — CDN edge diagnostics

CDNs introduce a layer between users and your origin. Modern CDNs each provide a variety of edge diagnostics—requests IDs, POP-level metrics, real-time log streams, and synthetic probes. Collect them in parallel.

What to pull from the CDN

Real-time logs: Pull edge logs (Logpush/Real-time) for the incident window. Look for surge in 5xx/504/522 errors and for request-IDs tied to failing requests.
POP distribution: Which Points-of-Presence saw the failures? Are failures concentrated in a region or global?
Control-plane alerts: Check the CDN status page and your account alerts for config changes or infra incidents.
Edge health checks: Were origin health checks failing? A health-check configuration mistake can pull an origin out of rotation.

CDN-specific tips

Cloudflare: capture CF-Ray headers and use Cloudflare's 'Trace' debug interface; download Logpull logs for the timeframe.
Fastly: use real-time logs and request debugging via their edge dictionaries and VCL logs.
AWS CloudFront: check real-time metrics and download real-time logs (if enabled); confirm the distribution's origin settings.

Step 4 — Browser-level evidence: HAR, NetLog and RUM

Users see the result: blank pages, timeouts, or TLS errors. Browser logs prove the user experience and often include request IDs that map to CDN logs.

Collect HAR and NetLog

HAR: In Chrome/Edge/Firefox, open DevTools > Network > right-click > Save all as HAR with content. HARs include resource timings and response codes.
NetLog (Chrome): Visit chrome://net-export and start capture. This produces a comprehensive network stack log that includes socket/TCP/TLS events—very useful for TLS handshake and DNS resolution failures.
RUM traces: If you run OpenTelemetry/Datadog/RUM scripts, pull RUM traces for the timeframe (they'll show real-user latencies and errors aggregated by geography and ISP).

What to extract from browser logs

DNS resolution times and failures (e.g., high DNS lookup or SERVFAIL).
TLS handshake failures and certificate errors.
Which resource(s) fail first—HTML main document or subresources (JS/CSS)?
Request IDs and headers that map to CDN/edge logs (e.g., CF-Ray, X-Request-ID).

Step 5 — Correlate: build a single timeline

Correlation is where root-cause emerges. Use a spreadsheet or a timeline tool (Elastic, Grafana Tempo) to line up events by UTC epoch seconds.

Normalize timestamps from BGP collectors, CDN logs, server logs, pcaps and browser logs to UTC.
Mark user-visible error spikes (from RUM or monitoring) and line up with BGP updates/withdrawals and CDN errors.
Filter pcaps for the exact request IDs or IPs in failing client logs and trace the TCP/TLS handshake timings.

Example correlation pattern: a sudden BGP withdrawal for your prefix at 10:32:10 UTC; multiple CDN POPs mark origin unreachable at 10:32:12; browser NetLog shows repeated TCP SYN retransmits starting at 10:32:13, and RUM spikes for 504s at 10:32:15. The chain points to a routing issue.

Advanced techniques

Use eBPF for high-fidelity tracing

When you need socket-level visibility without massive PCAPs, eBPF tools (bcc, bpftrace, Cilium/Hubble) provide per-socket and per-proc flow data with low cost.

Capture TCP latencies, retransmits and drops at the kernel level.
Trace TLS handshake durations per process to identify slow handshakes that would be invisible in application logs.

RPKI and BGP security checks

Check RPKI validation for your prefixes—an INVALID state can cause origin prefixes to be ignored by validating routers. Use public RPKI validators or your transit provider's console for evidence. For policy-level context on routing resilience and governance, see policy labs on digital resilience.

Automating forensic collection

For repeatable, fast investigations, script the collection:

Automate tcpdump startup on triggers (SRE runbooks, PagerDuty webhooks) — consider safe automation patterns and sandboxing tools described in guides on automation with sandboxing.
Use serverless functions to immediately request CDN log slices for a time window when an alert fires.
Have a single-click client-side diagnostic flow (HAR + NetLog upload) for affected users to speed triage.

Case study: simulated timeline and root cause (hypothetical)

Scenario: At 2026-01-16 10:30 UTC, customers report site errors. Here's a concise reconstruction.

10:30:05 — RUM shows spike in 5xx from North America.
10:30:08 — RIPE RIS shows withdrawals for prefix 203.0.113.0/24 from multiple collectors (BGP withdrawals).
10:30:10 — CDN edge logs show health-check failures for origin IP 203.0.113.45 across multiple POPs (500s returned).
10:30:12 — tcpdump at origin shows no incoming TCP SYNs from several upstream ASes; pcaps show no BGP messages because the route was withdrawn upstream.
10:30:20 — Browser NetLog from a user shows repeated DNS lookup + SYN retransmit — not a TLS error, no connection established.

Conclusion: a routing incident (likely an upstream BGP configuration error or leak) caused the origin to be unreachable to multiple CDN POPs. The CDN marked the origin unhealthy and began serving errors. Mitigation involved announcing the prefix from a backup transit, coordinating with the transit provider to remediate the misannouncement, and adding RPKI ROAs to prevent future accidental invalidations.

Practical checklist: what to collect now

Ring-buffer pcaps from edge and origin (tcpdump -s0 -C -W).
Server access and application logs (with rotation paused).
CDN real-time logs and POP-level metrics, request-IDs.
BGP collector snapshots (RIPE RIS, RouteViews) and Looking Glass outputs for your prefixes.
Browser HAR/NetLog from affected users and RUM traces.
Hashes of all files, stored in immutable storage.

Developer notes and gotchas

Timestamps: Avoid misattribution—timezone mismatches are the #1 cause of false correlation.
Privacy: Extract only headers and non-sensitive traces when sharing with third parties (redact cookies and auth headers).
Storage size: Full pcaps are large—use ring buffers and short capture windows anchored to alerts.
Third-party pushback: CDNs and transit providers may limit access to raw BGP logs; use their diagnostic APIs and request official incident summaries if needed.

Good forensic work is not just about finding who broke—it’s about producing irrefutable evidence that allowed you to fix it faster.

Future-proofing: what to implement now (2026+)

Enable CDN real-time log streaming to your SIEM and configure request-ID correlation with application logs.
Run an internal BGP monitor (use BGPStream or BGPmon) and alert on unexpected withdrawals for your prefixes.
Adopt eBPF-based observability for low-cost, high-fidelity kernel-level traces.
Enforce RPKI ROAs for your announced prefixes to reduce accidental invalidations and work with transit providers to validate them.
Provide a single-click client-side diagnostic flow (HAR + NetLog upload) for support to speed triage.

Actionable takeaways

When an outage starts: capture evidence immediately (pcap, CDN logs, browser HAR/NetLog) and freeze rotations.
Always query BGP collectors and Looking Glasses early—many “application outages” are routing faults.
Correlate request-IDs across CDN edge logs, origin logs and browser traces to prove causation.
Use eBPF for fast kernel-level debugging without terabytes of pcap files.
Make forensic collection automatic—scripted captures reduce human error under pressure.

Closing: next steps and call-to-action

Network forensics after a mass outage is a multidisciplinary exercise: BGP and DNS checks, packet captures, CDN edge diagnostics, and browser logs are all pieces of the same puzzle. In 2026, the difference between a good incident response and a great one is how quickly you can collect correlated, timestamped evidence across those layers.

If you want a turnkey starting point, download our incident-playbook template that includes tcpdump scripts, a BGP-check script, a CDN log puller, and a browser HAR collection guide—pre-filled for cloud and managed WordPress infrastructures. Use it to automate your first 10 minutes of forensic collection so your team can move from firefighting to root cause and remediation. For field-friendly collection and portable streaming kits, see our field review notes on portable streaming & POS setups.

Act now: get the playbook, integrate CDN log streaming into your SIEM, and schedule a table-top drill this quarter to validate your collection pipeline.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.