Video is unindexed by default
Most organizations sit on hundreds or thousands of hours of recorded video — sermons, lectures, training material, webinars, internal meetings, customer calls, podcast episodes, recorded courses. The footage is valuable. Almost none of it is searchable. The closest thing to an index is the file name and, if the team has been diligent, a paragraph of description.
When somebody wants to find "the part where we discussed the new pricing model," the actual workflow is: ask a colleague who might remember, scrub through likely files, or give up. The information exists. The retrieval system does not.
Keyword search over transcripts helps a little. If the speaker said "pricing model" verbatim, you find it. If they said "how we charge for the product," you do not — even though that is plainly the same topic. Keyword search is matching strings. The user is asking about meaning.
The architecture that closes this gap is semantic search over transcripts, built on top of the same RAG pattern that powers document Q&A. The pieces are well known individually. What matters is how they fit together for video, and where the implementation details quietly decide whether the system is useful or merely impressive in a demo.
The pipeline
The end-to-end shape is straightforward enough to draw on a napkin:
Two flows, sharing the same store. The ingestion flow runs once per video, asynchronously. The query flow runs once per user question, synchronously. Everything else is decisions about how each step is implemented.
Ingestion: from video to indexed chunks
Transcription. A video is uploaded to S3. An event triggers a transcription job — Rev AI, Amazon Transcribe, AssemblyAI, or Whisper running on a GPU instance, depending on language support and accuracy requirements. The transcript comes back as a sequence of segments, each carrying text and start/end timestamps. The timestamps are the single most important piece of metadata in the entire pipeline, and they survive every transformation downstream.
Chunking. The raw transcript is too granular for retrieval (individual segments are often only a few seconds long) and the whole transcript is too coarse. The chunker groups segments into coherent units — typically 30 to 90 seconds of speech, or roughly 200 to 500 tokens — aligned on natural pause points, speaker changes, or topic shifts where the transcription model emits them. Each chunk preserves the start timestamp of its first segment and the end timestamp of its last. The chunk is the unit of retrieval; the timestamp range is what makes it useful.
A chunk record in the store looks like this:
{
"videoId": "vid_4218",
"chunkId": "vid_4218_c_017",
"text": "Prayer is not only personal but corporate. When we gather...",
"start": 720,
"end": 786,
"speaker": "main",
"topicHint": null,
"createdAt": "2026-05-12T14:08:00Z"
}
topicHint is left null at ingestion time — topics get inferred at query time, not at indexing time, so the index does not have to be rebuilt every time the topic taxonomy changes.
Embedding. Each chunk's text is embedded using Titan Embed or Cohere Embed v3 via Bedrock. Cohere Embed v3 has better multilingual performance, which matters for clients operating across English, French, Portuguese, or Arabic content. Titan is cheaper. The choice is a per-deployment decision based on language mix and budget. Either way, the output is a vector — typically 1024 dimensions — that captures the semantic content of the chunk.
Vector storage. The vector goes into OpenSearch, indexed alongside the chunk's metadata. We use OpenSearch instead of a dedicated vector database for three reasons: it sits naturally in the AWS ecosystem alongside Bedrock and S3, it supports hybrid search (semantic plus keyword) in a single query, and the operational story — backups, replicas, monitoring — is the same story the team already runs for any other Elasticsearch-shaped workload. For deployments under a few hundred thousand chunks, pgvector inside the existing Postgres works fine and saves the OpenSearch operating cost.
The ingestion side is event-driven end to end. S3 upload triggers Lambda, Lambda enqueues a transcription job, transcription completion triggers chunking, chunking triggers embedding, embedding writes to OpenSearch. Each stage has its own dead-letter queue. A video that fails at one stage does not block any other video; the failed item lands in a queue an operator can inspect.
Retrieval: from a question to relevant moments
When a user asks "show me where we discussed corporate prayer," the system never sends the full transcript to a model. The flow is:
- The query is embedded with the same model used at ingestion. (Same model — different embedding spaces are not comparable.)
- OpenSearch runs a
knnsearch against the chunk index, optionally combined with a BM25 keyword search for hybrid retrieval. Hybrid search catches the cases where semantic search drifts (a name, a verbatim phrase, an identifier) and where keyword search fails (paraphrased questions, conceptual queries). - Metadata filters narrow the candidate space before vector math runs: video ID if the user is searching inside one video, date range, speaker, language, access permissions. Filtering before search is roughly the difference between a query that returns in 80ms and one that returns in 800ms.
- The top 20 candidates are reranked. A reranking model (Cohere Rerank via Bedrock) scores each candidate against the original query using cross-attention rather than embedding similarity alone. The top 5 after reranking are what the reasoning layer sees.
- Claude receives the user's question plus the top 5 chunks plus an instruction to identify which chunks actually contain the requested moment. The output is a structured list of moments — chunk IDs, timestamps, and a one-sentence summary of each.
The crucial property is that the timestamps survive the whole journey. The vector store had them. The retrieval returned them. The reasoning layer was instructed to preserve them. The result the user sees is a list of clickable clips, not a paragraph of prose about what the video "probably" contains.
The output the user actually wants
Semantic retrieval is the engine. The product is moments — discrete, playable spans of video with a topic label and a summary. The schema is small enough to embed in a single response:
{
"query": "corporate prayer",
"moments": [
{ "videoId": "vid_4218", "start": 720, "end": 786,
"summary": "Opening framing: prayer as a corporate, not just personal, act." },
{ "videoId": "vid_4218", "start": 1145, "end": 1212,
"summary": "Example of corporate prayer during the early church." }
]
}
That structure is what the frontend renders into a clip list. It is what an editing tool reads to assemble a highlight reel. It is what an analytics pipeline aggregates to produce "topics covered this month." The same retrieval pipeline serves search, summarization, and curation, because the unit of work — the moment — is consistent.
What separates the working systems from the impressive demos
A pipeline that runs on a single 45-minute video in a notebook is not the same pipeline that runs on twenty thousand hours in production. The differences are not glamorous, but they decide whether the system stays useful.
Chunk size matters more than embedding choice. Teams obsess over which embedding model to pick. The data on production retrieval quality is consistent: chunk strategy outweighs embedding choice by a wide margin. Get the chunks right first.
Reranking is not optional. Initial vector retrieval is fast and approximate. Reranking adds 150 to 400 ms of latency and routinely lifts top-5 precision by 15 to 25 percent. The cost is small, the accuracy gain is large, the architecture is straightforward — there is no good reason to skip it once the corpus grows.
Timestamps drift. Transcription models occasionally misalign their segments by one or two seconds. For a clip extraction product, that is the difference between starting on the speaker's word and starting mid-syllable. A "find mark" step that re-aligns the timestamp to the nearest sentence boundary in the original transcript closes this gap, and it is worth building.
Cost lives in embedding, not in retrieval. Indexing twenty thousand hours of video produces millions of chunks and millions of embedding calls. Retrieval costs are negligible by comparison — a vector search query is fractions of a cent. The deployment plan needs an honest ingestion budget; the running budget is mostly the cost of the reasoning model.
Permissions are not a layer you bolt on. Every chunk carries its source video's access metadata. The retrieval query enforces it as a filter. The reasoning layer never sees content the user does not have permission to see. This has to be in the index from day one — retrofitting permissions onto an existing vector store is painful and error-prone.
What the same architecture unlocks downstream
Once a transcript is chunked, embedded, and indexed, the same pipeline does more than search.
Topic extraction. A topic is a query in disguise. "Find the parts about pricing" is a search; "what topics did this video cover" is a clustering problem solved over the same chunks.
Automatic summarization. Summarize each chunk, then summarize the summaries, hierarchically. Each step is the reasoning model receiving a narrow, high-quality window of retrieved content.
Document-driven moment extraction. Upload a study guide or a lecture syllabus. The reasoning model extracts the questions or topics from the document, the retrieval layer finds the matching moments in the video corpus, and the system returns a mapping from document section to video timestamp range. The same primitives — chunks, embeddings, retrieval, reasoning — compose into a product feature that would otherwise be its own multi-month project.
Cross-video search. Once every video in the corpus is indexed against the same embedding space, the user can ask one question across the entire library. The retrieval layer does not care whether the answer is in one video or fifty.
That is the leverage of the architecture: ingest once, index once, and every new feature is a different way of composing retrieval and reasoning over the same chunks. The transcript stops being a thing you scroll through. It becomes a thing the system understands well enough to act on. That is the difference between video as a file and video as intelligence.
