Video Delivery

Accessible Streaming: Efficient Captions and Localization

A practical guide to implementing captions, subtitles, audio description, and localization for OTT streaming apps across connected TV platforms.

May 05, 2026

Streaming interface showing caption rendering and multi-language subtitle selection on a TV screen

Accessibility in OTT is not optional. Regulatory requirements (FCC in the US, CVAA, European Accessibility Act) mandate closed captions and audio description for qualifying content. Beyond compliance, good caption and localization implementation expands your audience: captions serve viewers who are deaf or hard of hearing, but they are also used by viewers watching in noisy environments, viewers watching in a non-native language, and viewers who simply prefer reading along.

This guide covers the practical engineering of captions, subtitles, audio description, and multi-language support for streaming apps across connected TV platforms.

Captions vs subtitles: the technical distinction

Closed captions include all audible information: dialogue, speaker identification, sound effects (“[door slams]”), music descriptions (“[tense music]”), and other non-speech audio cues. They are designed for viewers who cannot hear the audio. In the US, closed captions are legally required for most video content that was previously broadcast on television.

Subtitles are a translation of dialogue only. They assume the viewer can hear sound effects and music but needs the dialogue in a different language. Subtitles do not typically include non-speech audio descriptions.

In practice, many streaming services use “subtitles” as a catch-all term for both. The technical implementation is the same — timed text overlaid on video — but the content and regulatory requirements differ.

Caption and subtitle formats

WebVTT

WebVTT (Web Video Text Tracks) is the standard format for web-based players and HLS delivery. It is a plain-text format with timestamps and cue text:

WEBVTT

00:00:01.000 --> 00:00:04.000
Welcome to the demonstration.

00:00:05.500 --> 00:00:08.200
[ambient music playing]

00:00:09.000 --> 00:00:12.500
Today we'll cover the basics
of adaptive streaming.

WebVTT supports:

Positioning (top, bottom, left, right)
Styling via CSS (::cue pseudo-element)
Speaker identification
Vertical text (for CJK languages)

WebVTT is the recommended format for OTT delivery because of its broad player and platform support.

TTML (Timed Text Markup Language)

TTML is XML-based and more expressive than WebVTT. It supports fine-grained styling, layout regions, and timing models. DASH manifests typically reference TTML sidecars.

TTML is more complex to parse and render than WebVTT. On smart TV platforms with constrained JavaScript performance, TTML parsing can be slower. Use TTML when your DASH player requires it, and WebVTT for HLS and web-based players.

EBU-TT-D

EBU-TT-D is a TTML profile used in European broadcasting. If your content originates from European broadcasters, you may receive captions in this format. Convert to WebVTT for HLS delivery and use EBU-TT-D directly for DASH where the player supports it.

608/708 embedded captions

US broadcast content often includes CEA-608 and CEA-708 captions embedded in the video stream’s SEI (Supplemental Enhancement Information) data. These are carried within the video elementary stream, not as separate sidecar files.

For OTT delivery, extract 608/708 captions during packaging and convert to WebVTT sidecars. Most packagers (Shaka Packager, AWS MediaConvert, Harmonic) handle this conversion.

Delivery architecture

Sidecar vs in-band

Sidecar delivery packages captions as separate files referenced in the manifest. The player downloads caption files independently of video segments.

HLS: #EXT-X-MEDIA:TYPE=SUBTITLES in the manifest, pointing to WebVTT segment playlists
DASH: AdaptationSet with contentType="text" pointing to TTML or WebVTT segments

In-band delivery embeds captions within the video or audio segments. CEA-608/708 in the video stream is an example. Some implementations embed WebVTT cues within fMP4 segments.

Sidecar delivery is simpler, more compatible, and easier to update (you can fix a caption file without re-encoding the video). Use sidecar delivery as the default.

Segmented vs monolithic caption files

For VOD, you can deliver the entire caption track as a single file or segment it to match the video segments:

Monolithic: one WebVTT file for the entire content. The player downloads it at startup. Simple, but the file can be large for long content (2+ hours).
Segmented: WebVTT files segmented to match video segment boundaries. The player downloads caption segments alongside video segments. Better for live content and for reducing initial download size.

For live streaming, segmented captions are required since the full caption track does not exist yet. For VOD, either approach works. Segmented is preferred for consistency with the rest of the ABR delivery model.

Rendering on connected TV platforms

Platform-specific rendering behavior

Each TV platform renders captions differently:

Roku: Roku’s native Video node handles WebVTT rendering. The system caption settings (font, size, color, background) override the app’s styling. Users configure caption appearance in Roku’s system settings, and your app must respect those settings.

Samsung Tizen: when using the AVPlay API, Samsung handles caption rendering natively. When using MSE-based players, the web app renders captions via the player library’s DOM-based renderer. Be aware of Tizen’s Chromium engine differences that may affect CSS rendering of caption overlays.

Google TV: ExoPlayer (Media3) renders captions natively with system caption settings. The system provides a CaptioningManager API for reading user preferences (font scale, color, background).

LG webOS: similar to Samsung — native rendering for the system player, web-based rendering for MSE players.

Respect system caption settings

US regulations (CVAA) require that streaming apps honor the viewer’s system-level caption preferences for:

Font size and scaling
Font color and opacity
Background color and opacity
Window color
Edge style (drop shadow, raised, depressed)

On each platform, query the system caption settings API and apply them to your caption renderer. Do not override user preferences with your own styling.

Audio description

Audio description (also called video description or described video) is a separate audio track that narrates visual information for viewers who are blind or have low vision. A narrator describes on-screen actions, scene changes, character expressions, and visual details during pauses in the dialogue.

Implementation

Audio description is delivered as an additional audio track in the manifest:

HLS: #EXT-X-MEDIA:TYPE=AUDIO with CHARACTERISTICS="public.accessibility.describes-video"
DASH: AdaptationSet with Role schemeIdUri="urn:mpeg:dash:role:2011" value="description"

The player allows the viewer to select the audio description track alongside the standard audio. The video stream is the same — only the audio mix changes.

Production considerations

Audio description must be authored and produced — it is not auto-generated. Professional audio description requires:

A narrator with clear, neutral delivery
Timing that fits between dialogue pauses
Descriptive content that adds meaning without overlapping speech

For your video pipeline, this means receiving an additional audio track per language that includes the audio description mix, encoding it alongside other audio tracks, and referencing it in the manifest with the correct accessibility markers.

Multi-language subtitle and audio management

Content localization workflow

For each piece of content, the localization pipeline produces:

Subtitle files for each target language (WebVTT format)
Dubbed audio tracks for each target language (where applicable)
Forced subtitles for foreign-language dialogue within an otherwise native-language track (e.g., subtitles for alien speech in an English movie)

Manifest structure for multi-language

An HLS manifest for a well-localized title might include:

#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="audio",NAME="English",LANGUAGE="en",DEFAULT=YES,URI="audio_en.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="audio",NAME="Spanish",LANGUAGE="es",URI="audio_es.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="audio",NAME="English (Audio Description)",LANGUAGE="en",CHARACTERISTICS="public.accessibility.describes-video",URI="audio_en_ad.m3u8"

#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="English (CC)",LANGUAGE="en",CHARACTERISTICS="public.accessibility.transcribes-spoken-dialog",DEFAULT=YES,FORCED=NO,URI="subs_en.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="Spanish",LANGUAGE="es",FORCED=NO,URI="subs_es.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="English (Forced)",LANGUAGE="en",FORCED=YES,URI="subs_en_forced.m3u8"

Player language selection

On connected TV platforms, read the system language preference and select the matching audio and subtitle tracks by default:

If the system language matches an available audio track, select it as default
If the viewer’s preferred subtitle language is set, enable subtitles in that language
Store the viewer’s language preference in your app and restore it across sessions

Testing accessibility

Caption accuracy testing

Verify caption timing matches audio (within 100ms tolerance)
Verify speaker identification is present for multi-speaker content
Verify non-speech audio descriptions are included (for closed captions)
Test caption rendering with all system style settings (large font, yellow text on black, etc.)
Test on actual TV hardware — caption rendering on a 55-inch screen is different from a laptop

Audio description testing

Verify the audio description track plays correctly when selected
Verify the narration does not overlap with dialogue
Verify the track is correctly labeled in the player UI

Multi-language testing

Verify all subtitle languages display correctly (especially non-Latin scripts: CJK, Arabic, Hebrew)
Verify audio track switching works without playback interruption
Verify forced subtitles appear automatically when foreign dialogue occurs
Test on each target platform — font rendering for non-Latin scripts varies by device

Accessibility is an ongoing commitment, not a one-time checkbox. Each new piece of content needs captions, each platform update may change rendering behavior, and regulatory requirements evolve. Build accessibility into your release process from the start rather than retrofitting it later.

More resources

Browse the full set of guides and platform notes.

All Guides