Accessibility in OTT is not optional. Regulatory requirements (FCC in the US, CVAA, European Accessibility Act) mandate closed captions and audio description for qualifying content. Beyond compliance, good caption and localization implementation expands your audience: captions serve viewers who are deaf or hard of hearing, but they are also used by viewers watching in noisy environments, viewers watching in a non-native language, and viewers who simply prefer reading along.
This guide covers the practical engineering of captions, subtitles, audio description, and multi-language support for streaming apps across connected TV platforms.
Captions vs subtitles: the technical distinction
Closed captions include all audible information: dialogue, speaker identification, sound effects (“[door slams]”), music descriptions (“[tense music]”), and other non-speech audio cues. They are designed for viewers who cannot hear the audio. In the US, closed captions are legally required for most video content that was previously broadcast on television.
Subtitles are a translation of dialogue only. They assume the viewer can hear sound effects and music but needs the dialogue in a different language. Subtitles do not typically include non-speech audio descriptions.
In practice, many streaming services use “subtitles” as a catch-all term for both. The technical implementation is the same — timed text overlaid on video — but the content and regulatory requirements differ.
Caption and subtitle formats
WebVTT
WebVTT (Web Video Text Tracks) is the standard format for web-based players and HLS delivery. It is a plain-text format with timestamps and cue text:
WEBVTT
00:00:01.000 --> 00:00:04.000
Welcome to the demonstration.
00:00:05.500 --> 00:00:08.200
[ambient music playing]
00:00:09.000 --> 00:00:12.500
Today we'll cover the basics
of adaptive streaming.
WebVTT supports:
- Positioning (top, bottom, left, right)
- Styling via CSS (
::cuepseudo-element) - Speaker identification
- Vertical text (for CJK languages)
WebVTT is the recommended format for OTT delivery because of its broad player and platform support.
TTML (Timed Text Markup Language)
TTML is XML-based and more expressive than WebVTT. It supports fine-grained styling, layout regions, and timing models. DASH manifests typically reference TTML sidecars.
TTML is more complex to parse and render than WebVTT. On smart TV platforms with constrained JavaScript performance, TTML parsing can be slower. Use TTML when your DASH player requires it, and WebVTT for HLS and web-based players.
EBU-TT-D
EBU-TT-D is a TTML profile used in European broadcasting. If your content originates from European broadcasters, you may receive captions in this format. Convert to WebVTT for HLS delivery and use EBU-TT-D directly for DASH where the player supports it.
608/708 embedded captions
US broadcast content often includes CEA-608 and CEA-708 captions embedded in the video stream’s SEI (Supplemental Enhancement Information) data. These are carried within the video elementary stream, not as separate sidecar files.
For OTT delivery, extract 608/708 captions during packaging and convert to WebVTT sidecars. Most packagers (Shaka Packager, AWS MediaConvert, Harmonic) handle this conversion.
Delivery architecture
Sidecar vs in-band
Sidecar delivery packages captions as separate files referenced in the manifest. The player downloads caption files independently of video segments.
- HLS:
#EXT-X-MEDIA:TYPE=SUBTITLESin the manifest, pointing to WebVTT segment playlists - DASH:
AdaptationSetwithcontentType="text"pointing to TTML or WebVTT segments
In-band delivery embeds captions within the video or audio segments. CEA-608/708 in the video stream is an example. Some implementations embed WebVTT cues within fMP4 segments.
Sidecar delivery is simpler, more compatible, and easier to update (you can fix a caption file without re-encoding the video). Use sidecar delivery as the default.
Segmented vs monolithic caption files
For VOD, you can deliver the entire caption track as a single file or segment it to match the video segments:
- Monolithic: one WebVTT file for the entire content. The player downloads it at startup. Simple, but the file can be large for long content (2+ hours).
- Segmented: WebVTT files segmented to match video segment boundaries. The player downloads caption segments alongside video segments. Better for live content and for reducing initial download size.
For live streaming, segmented captions are required since the full caption track does not exist yet. For VOD, either approach works. Segmented is preferred for consistency with the rest of the ABR delivery model.
Rendering on connected TV platforms
Platform-specific rendering behavior
Each TV platform renders captions differently:
Roku: Roku’s native Video node handles WebVTT rendering. The system caption settings (font, size, color, background) override the app’s styling. Users configure caption appearance in Roku’s system settings, and your app must respect those settings.
Samsung Tizen: when using the AVPlay API, Samsung handles caption rendering natively. When using MSE-based players, the web app renders captions via the player library’s DOM-based renderer. Be aware of Tizen’s Chromium engine differences that may affect CSS rendering of caption overlays.
Google TV: ExoPlayer (Media3) renders captions natively with system caption settings. The system provides a CaptioningManager API for reading user preferences (font scale, color, background).
LG webOS: similar to Samsung — native rendering for the system player, web-based rendering for MSE players.
Respect system caption settings
US regulations (CVAA) require that streaming apps honor the viewer’s system-level caption preferences for:
- Font size and scaling
- Font color and opacity
- Background color and opacity
- Window color
- Edge style (drop shadow, raised, depressed)
On each platform, query the system caption settings API and apply them to your caption renderer. Do not override user preferences with your own styling.
Audio description
Audio description (also called video description or described video) is a separate audio track that narrates visual information for viewers who are blind or have low vision. A narrator describes on-screen actions, scene changes, character expressions, and visual details during pauses in the dialogue.
Implementation
Audio description is delivered as an additional audio track in the manifest:
- HLS:
#EXT-X-MEDIA:TYPE=AUDIOwithCHARACTERISTICS="public.accessibility.describes-video" - DASH:
AdaptationSetwithRole schemeIdUri="urn:mpeg:dash:role:2011" value="description"
The player allows the viewer to select the audio description track alongside the standard audio. The video stream is the same — only the audio mix changes.
Production considerations
Audio description must be authored and produced — it is not auto-generated. Professional audio description requires:
- A narrator with clear, neutral delivery
- Timing that fits between dialogue pauses
- Descriptive content that adds meaning without overlapping speech
For your video pipeline, this means receiving an additional audio track per language that includes the audio description mix, encoding it alongside other audio tracks, and referencing it in the manifest with the correct accessibility markers.
Multi-language subtitle and audio management
Content localization workflow
For each piece of content, the localization pipeline produces:
- Subtitle files for each target language (WebVTT format)
- Dubbed audio tracks for each target language (where applicable)
- Forced subtitles for foreign-language dialogue within an otherwise native-language track (e.g., subtitles for alien speech in an English movie)
Manifest structure for multi-language
An HLS manifest for a well-localized title might include:
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="audio",NAME="English",LANGUAGE="en",DEFAULT=YES,URI="audio_en.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="audio",NAME="Spanish",LANGUAGE="es",URI="audio_es.m3u8"
#EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="audio",NAME="English (Audio Description)",LANGUAGE="en",CHARACTERISTICS="public.accessibility.describes-video",URI="audio_en_ad.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="English (CC)",LANGUAGE="en",CHARACTERISTICS="public.accessibility.transcribes-spoken-dialog",DEFAULT=YES,FORCED=NO,URI="subs_en.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="Spanish",LANGUAGE="es",FORCED=NO,URI="subs_es.m3u8"
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="subs",NAME="English (Forced)",LANGUAGE="en",FORCED=YES,URI="subs_en_forced.m3u8"
Player language selection
On connected TV platforms, read the system language preference and select the matching audio and subtitle tracks by default:
- If the system language matches an available audio track, select it as default
- If the viewer’s preferred subtitle language is set, enable subtitles in that language
- Store the viewer’s language preference in your app and restore it across sessions
Testing accessibility
Caption accuracy testing
- Verify caption timing matches audio (within 100ms tolerance)
- Verify speaker identification is present for multi-speaker content
- Verify non-speech audio descriptions are included (for closed captions)
- Test caption rendering with all system style settings (large font, yellow text on black, etc.)
- Test on actual TV hardware — caption rendering on a 55-inch screen is different from a laptop
Audio description testing
- Verify the audio description track plays correctly when selected
- Verify the narration does not overlap with dialogue
- Verify the track is correctly labeled in the player UI
Multi-language testing
- Verify all subtitle languages display correctly (especially non-Latin scripts: CJK, Arabic, Hebrew)
- Verify audio track switching works without playback interruption
- Verify forced subtitles appear automatically when foreign dialogue occurs
- Test on each target platform — font rendering for non-Latin scripts varies by device
Accessibility is an ongoing commitment, not a one-time checkbox. Each new piece of content needs captions, each platform update may change rendering behavior, and regulatory requirements evolve. Build accessibility into your release process from the start rather than retrofitting it later.