A CDN for live video events faces different challenges than a CDN for VOD or general web content. The traffic pattern is concentrated: millions of viewers requesting the same content at the same time, all watching the live edge. The cache window is tiny — segments that are seconds old are the freshest content. And failure is not an option — there is no retry button for a live event.
This guide covers how to build and operate CDN infrastructure that handles live events at scale, from sports broadcasts to global premieres.
Live CDN architecture fundamentals
The live delivery chain
- Encoder/packager produces live HLS/DASH segments and manifests in real time
- Origin stores the live segments and serves the manifest
- Mid-tier cache (shield) absorbs repeated requests from edge nodes
- Edge PoPs serve viewers from cache, falling back to mid-tier on miss
- Player fetches manifest and segments, managing ABR and playback
Each layer has specific requirements for live:
- Origin must publish segments with minimal latency. A 500ms delay at the origin propagates to every viewer.
- Mid-tier must cache segments immediately and serve them to edge nodes with minimal added latency.
- Edge must handle the thundering herd: the first request for a new segment triggers a cache miss, and all subsequent requests must be held until the segment is cached (request coalescing).
Request coalescing
When a new live segment is published, hundreds of edge PoPs simultaneously request it from the mid-tier or origin. Without request coalescing, each concurrent request triggers a separate origin fetch.
Request coalescing (also called request collapsing or stale-while-revalidate) ensures that only the first request for a segment goes to origin. All subsequent requests for the same segment are queued and served from the same response.
Most CDNs support this natively, but it must be explicitly enabled and tested for live content. Misconfigured coalescing causes either origin overload (no coalescing) or increased latency (overly aggressive hold times).
Capacity planning for live events
Estimating bandwidth requirements
Start with:
peak_bandwidth = concurrent_viewers × average_bitrate × (1 + overhead_factor)
Where:
concurrent_viewersis the expected peak (estimate from marketing, pre-registrations, or historical data)average_bitrateis the weighted average across ABR rungs (typically 60-70% of the top rung, since not all viewers have enough bandwidth for max quality)overhead_factoraccounts for manifest requests, retries, and ABR switches (typically 10-15%)
A 5-million-viewer live event with an average bitrate of 4 Mbps needs:
5,000,000 × 4 Mbps × 1.12 = ~22.4 Tbps of edge throughput
That is a lot. No single CDN PoP handles that — the load must be distributed across hundreds of edge locations globally.
Geographic distribution
Estimate viewer distribution by geography. A US-focused event concentrates traffic on US edge PoPs. A global event distributes across all regions. Communicate the expected geographic distribution to your CDN provider so they can position capacity appropriately.
CDN provider coordination
For events above 1 million concurrent viewers, engage your CDN provider’s event engineering team at least 2 weeks in advance. They can:
- Pre-position cache capacity at high-traffic PoPs
- Configure origin shield regions to match your ingest location
- Set up dedicated origin connections if needed
- Provide a dedicated support contact during the event
Multi-tier caching for live
Two-tier vs three-tier
Two-tier (origin + edge): simpler to configure. Works well for events under 1 million viewers or when using a CDN with a dense edge network. The risk is origin overload from edge cache misses.
Three-tier (origin + shield + edge): the shield layer absorbs edge cache misses. Only one request per segment reaches origin (from the shield), regardless of how many edge PoPs need the segment. This is essential for large events.
Shield placement
Place the CDN shield in the same region as your origin. If your origin is in AWS us-east-1, your shield should be in a CDN PoP in the US East region. This minimises the shield-to-origin round trip.
For global events with multiple origin regions (for redundancy), use region-specific shields, each paired with its local origin.
Manifest caching strategy
Live manifests update every segment duration (2-6 seconds). Cache the manifest at the shield and edge with a short TTL (half the segment duration). For a 4-second segment duration:
- Manifest TTL at edge: 2 seconds
- Manifest TTL at shield: 1 second
This ensures viewers get a fresh manifest within one segment duration while reducing origin manifest request volume.
For low-latency streams, manifest caching is tighter. LL-HLS blocking playlist reloads require the CDN to hold the connection until the manifest updates. Not all CDN configurations support this — verify with your CDN provider.
Segment caching strategy
Live segments are immutable once published. Cache them with a long TTL (hours or longer). The segment will never change, so there is no freshness concern. The only reason to limit segment TTL is storage capacity at the edge, which is rarely a constraint for live content (the segment count is bounded by the DVR window).
Origin protection
The origin is the single point of failure in a live CDN architecture. Protecting it is critical.
Rate limiting
Configure origin-side rate limits per edge PoP connection. The CDN shield should be the only entity making frequent requests to origin. Direct origin requests from outside the CDN should be blocked or severely rate-limited.
Health checks and failover
Run active health checks against the origin from the CDN shield. If the primary origin fails health checks, automatically failover to a secondary origin in a different region or availability zone.
Health check interval: every 5-10 seconds for live. Failover threshold: 2-3 consecutive failures. Failback: automatic once the primary origin passes health checks again.
Redundant origin
For high-value live events, run a redundant origin in a separate availability zone or region. Both origins receive the same encoder output (via redundant ingest paths). The CDN shield uses the primary origin and fails over to the secondary on health check failure.
Real-time monitoring during live events
Key metrics to watch
During a live event, monitor these in real time (updated every 10-30 seconds):
- Concurrent viewers — is the audience tracking your expectations?
- Edge bandwidth per region — is any region approaching its capacity ceiling?
- Cache hit ratio at edge — should be 99%+ after the first segment. Drops indicate configuration issues.
- Origin request rate — should be low and stable. Spikes indicate shield or coalescing problems.
- Segment download time (P95) — rising P95 indicates CDN throughput pressure.
- Rebuffering ratio — the most important viewer-facing metric.
War room protocol
For events above 500K concurrent viewers, run a war room during the event:
- Engineering team monitoring CDN, origin, and player metrics
- CDN provider on standby (dedicated Slack channel or phone bridge)
- Runbook with pre-defined actions for common scenarios:
- CDN edge overload → shift traffic to backup CDN
- Origin failure → verify failover and confirm secondary origin is serving
- Regional outage → reroute affected viewers to nearest healthy region
- Encoder failure → switch to backup encoder feed
Multi-CDN for live events
For events at scale, multi-CDN delivery provides both capacity and resilience:
- Active-active distribution across two CDNs ensures no single CDN failure affects all viewers
- DNS-based steering distributes viewers across CDNs based on geography and CDN health
- Client-side CDN switching provides the fastest failover: the player detects segment download failures and switches to an alternative CDN endpoint
The overhead of multi-CDN is operational complexity: managing configurations, monitoring, and contracts with multiple providers. But for events where a single rebuffer event costs hundreds of thousands of dollars in advertiser guarantees, the insurance is worth it.
Post-event analysis
After every live event, conduct a retrospective:
- QoE summary: startup time distribution, rebuffering ratio, failure rate, quality distribution
- CDN performance: cache hit ratio, origin offload, segment download times per CDN per region
- Incidents: what went wrong, how it was detected, how fast it was resolved
- Capacity validation: did actual viewership match estimates? Were there capacity-related issues?
- Improvement actions: what to fix before the next event
Each live event is a learning opportunity. The data from this event improves the architecture, monitoring, and operational readiness for the next one. For broader delivery optimisation, see our video delivery performance solutions.