Turning Uploaded Documents into Searchable Video Intelligence

The user has the questions. They just do not know how to phrase them.

A user opens a search box over a video library. The cursor blinks. They have a vague sense of what they want but no clean query for it. They have a study guide that lists fourteen questions. They have a syllabus with eight topics per week. They have a training agenda with twenty-three procedures. The structure of what they actually need lives in a document. The search box does not know that.

So they type "prayer" or "logistics" or "compliance" and get a flat list of moments and have to do the work of mapping those results back onto the document themselves. The system gave them search. The user wanted alignment.

The feature that closes this gap is conceptually simple and operationally rich. The user uploads the document. The system reads it, extracts the structure — topics, questions, sections — and runs each one through the same semantic retrieval pipeline that powers search over the video transcript index. The output is a mapping from document section to a list of video moments. The user gets their study guide back, annotated with where each idea lives in the footage.

This is the pattern we deploy for clients who run faith-based, educational, and corporate training video platforms. The primitive is not new — it composes retrieval and reasoning over an existing chunk index — but the user experience it produces is a category shift from generic search.

The user does not have to know what to search for

The reason document-driven retrieval is more powerful than free-text search is not technical. It is cognitive.

Free-text search assumes the user can articulate what they want. In practice, most users cannot. They know what they are working on — a curriculum, a small group lesson, an onboarding plan — but the exact phrase that will retrieve the right moment is hidden from them. They search for "leadership" and the speaker said "stewardship." They search for "rollout plan" and the relevant segment talks about "go-live phasing." Keyword search fails. Even semantic search needs a half-decent query to start from.

A document supplies the queries. Fourteen questions in a study guide become fourteen queries. Eight topics in a syllabus week become eight queries. The user does not have to think about what to search. They hand the system the artifact that already encodes the structure of their work, and the system asks the questions on their behalf.

This shifts the entire interaction. The user is no longer composing search terms — they are uploading intent.

The pipeline

The flow is short. Every step uses primitives that already exist in a working semantic video pipeline.

Document upload → Preprocessing (OCR if scanned) → LLM topic extraction
                                                            ↓
                                              Structured topic list
                                                            ↓
                              For each topic: embed → vector search → rerank
                                                            ↓
                                                   Candidate chunks
                                                            ↓
                                              Reasoning over chunks
                                                            ↓
                                                  Moments per topic
                                                            ↓
                                          Persist Document → Topics → Moments

Each stage has a narrow job. The document is parsed once, the topics are extracted once, the retrieval runs once per topic, and the result is a tree the frontend can render directly.

Preprocessing: getting the document into clean text

A document arrives as a PDF, a Word file, an EPUB, or a scanned image. The preprocessor normalizes all of these to clean text plus a light structural outline — headings, lists, numbered questions, section breaks.

For native digital documents this is mostly extraction. For scanned material it is OCR — Textract, Tesseract, or a cloud OCR API depending on language coverage. The output of this stage is two things: the plain text of the document, and a list of structural anchors (page numbers, heading positions) that survive the rest of the pipeline. The anchors matter later: they let the UI link a topic back to the exact spot in the original document, the same way timestamps anchor moments back to video.

The preprocessor is also where format quirks get neutralized. A study guide with handwritten annotations gets the annotations dropped. A syllabus with a table of contents gets the table of contents collapsed. A training manual with embedded diagrams gets the diagrams replaced with their captions. None of this is glamorous. All of it decides whether the next stage gets a clean input or a noisy one.

Topic extraction: the document becomes a query plan

The cleaned text is sent to a reasoning model — Claude on Bedrock, in our deployments — with an extraction prompt that asks for a structured topic tree. The instruction is specific: identify the discrete units the user would want to find in a video, preserve the document's own hierarchy, and emit a JSON structure with stable identifiers.

The model returns something like this:

{
  "documentId": "doc_8821",
  "title": "Sermon study guide — Week 6: Prayer in community",
  "topics": [
    {
      "topicId": "t_001",
      "section": "Opening discussion",
      "query": "What does it mean to pray together as a community rather than alone?",
      "anchors": { "page": 1, "heading": "Opening discussion" }
    },
    {
      "topicId": "t_002",
      "section": "Scripture focus",
      "query": "How does Acts 2:42 describe the early church's practice of prayer?",
      "anchors": { "page": 1, "heading": "Scripture focus" }
    },
    {
      "topicId": "t_003",
      "section": "Application",
      "query": "How can a small group structure regular times of corporate prayer?",
      "anchors": { "page": 2, "heading": "Application" }
    }
  ]
}

Three properties of this output matter.

Each topic carries a natural-language query, not just a label. The query is the form the retrieval layer wants. "Application" alone is a poor search term. "How can a small group structure regular times of corporate prayer" is a query the vector index can do something useful with. The extraction model's main job is rewriting document fragments into retrievable questions.

Each topic carries an anchor back to the document. When the user reads their study guide later, they expect to click "Application" and see the moments. The anchor is what makes that link possible.

Each topic carries a stable identifier. This matters once results are persisted — the user can edit the document, re-extract topics, and the system can diff old topics against new ones without losing the moments already attached to unchanged sections.

The extraction model is also doing implicit error correction. A poorly structured document — bullet points with no headings, mixed numbering schemes, paragraphs that contain three separate questions — gets normalized into a clean topic list. The reasoning layer is paying its rent here, doing structural work that no regex or document parser can do reliably.

Retrieval: each topic runs through the same pipeline as a user query

Once the topic list exists, the rest of the system is already built. Each topic's query is embedded with the same model used at video ingestion. The embedding hits the OpenSearch index. Metadata filters narrow the search to the video corpus the user has selected — a single sermon series, a course's lectures, a meeting room's recordings. The top candidates are reranked. The reasoning model identifies which chunks contain the moment that answers the topic, and the timestamps are anchored to transcript content so the resulting clips actually start where the speaker starts.

Each topic produces its own list of moments. Most topics produce two to five. Some produce zero. Some produce a dozen — and that is fine. The user is not asking for a single answer per topic; they are asking for every place this idea shows up.

The data model: a Document is a tree of Topics, each Topic is a list of Moments

The persistence shape is small enough to fit in a single response. Once retrieval completes, the result looks like this:

{
  "documentId": "doc_8821",
  "title": "Sermon study guide — Week 6: Prayer in community",
  "sourceCorpus": "sermon_archive_2024",
  "createdAt": "2026-05-14T11:42:00Z",
  "topics": [
    {
      "topicId": "t_001",
      "section": "Opening discussion",
      "query": "What does it mean to pray together as a community rather than alone?",
      "moments": [
        { "videoId": "vid_4218", "start": 720, "end": 786,
          "summary": "Opening framing: prayer as a corporate, not just personal, act." },
        { "videoId": "vid_4231", "start": 442, "end": 511,
          "summary": "Distinction between private devotion and gathered prayer." }
      ]
    },
    {
      "topicId": "t_002",
      "section": "Scripture focus",
      "query": "How does Acts 2:42 describe the early church's practice of prayer?",
      "moments": [
        { "videoId": "vid_4218", "start": 1145, "end": 1212,
          "summary": "Direct exposition of Acts 2:42, focused on devotion to prayer." }
      ]
    },
    {
      "topicId": "t_003",
      "section": "Application",
      "query": "How can a small group structure regular times of corporate prayer?",
      "moments": []
    }
  ]
}

The third topic returned no moments. That is a real signal — the video corpus does not cover this part of the study guide — and the UI surfaces it honestly. The user sees the gap, which is itself useful information: the small group leader knows they need to bring outside material to that section.

Stored, this becomes three tables: Documents, Topics (foreign key to Document), Moments (foreign key to Topic, plus the video ID and the timestamp range). The schema mirrors the tree exactly. Updates are cheap — re-extracting topics after a document edit creates new Topic rows, and re-running retrieval for a single topic does not touch the rest.

The use cases compose

The same pattern serves several products without changing any of the plumbing.

Sermon study guides map onto a sermon archive. Each question in the guide becomes a query against the church's recorded teaching, and small group leaders get a ready-made list of clips to play during discussion. Scripture references in the guide become an extra filter — the retrieval layer prefers moments where the speaker quotes the cited passage.

Lecture syllabi map onto a course's recordings. A student reading week six's topics can jump directly to the segments of the recorded lectures where each concept was taught. The professor uploads the syllabus once, the index is built once, and every cohort of students gets the same alignment automatically.

Training manuals map onto a recorded training corpus. Each procedure in the manual becomes a query for the segment where that procedure was demonstrated. New hires read a step, watch the relevant 90 seconds, and move on.

Meeting agendas map onto recorded meetings. The agenda for last Tuesday's planning session becomes a structured index of when each item was discussed. Someone who missed the meeting gets a navigation tree, not a two-hour recording.

The architecture does not change between these. The extraction prompt is mildly different per document type — a syllabus is parsed differently than a study guide — but the retrieval pipeline, the chunk index, the moment schema, and the timestamp anchoring are identical. Each new use case is a different document type composed against the same primitives.

The edge cases

A real implementation lives in its edge cases.

Documents with no clear structure. A free-form prose document — an essay, a long-form question, an unstructured reflection — has no headings or numbered list to anchor topics against. The extraction model is allowed to invent structure: it returns a flat list of topics with synthetic section labels and an explicit structureInferred: true flag. The UI displays this state differently, signaling that the topic boundaries are the system's interpretation rather than the author's.

Topics not covered in the video corpus. A study guide asks a question the sermon never addressed. The retrieval returns nothing above the relevance threshold. The system persists the topic with an empty moments list, the UI shows the gap, and an analytics signal goes to the platform owner: "twelve percent of uploaded study guides reference topics not in your corpus." That is a content planning signal, not a bug.

Multiple matches per topic. A topic about "prayer" surfaces twenty-two candidate moments across the corpus. Ranking matters here. The pipeline keeps the top three to five by relevance score, but exposes a "show more" path that re-queries with a higher candidate ceiling. The user does not have to scroll through twenty moments; the system makes the cut, and lets the user widen it on demand.

Documents that span multiple corpora. A semester syllabus references both the recorded lectures and a separate library of guest talks. The retrieval pipeline runs each topic against both indexes and merges results, with the source clearly labeled on each moment. The user does not care which corpus a result came from. They care that the syllabus is now navigable.

Re-extraction after document edits. A user updates their study guide. The system re-runs extraction, diffs the new topic list against the old one, and only re-queries the changed topics. Stable identifiers on topics make this safe. Moments attached to unchanged topics survive the edit.

What this unlocks

A user who uploads a document to a video platform is doing the system a favor. They are handing over a curated, structured map of what they care about. The platform that treats that document as a query plan — extract topics, retrieve moments, persist the alignment — turns its video corpus into something the user can navigate by their own thinking, not by the platform's search box.

The architecture is not new. It is the same retrieve-first, reason-second pattern, composed against the same transcript chunk index, with the same timestamp anchoring on the way out. What changes is the entry point. Instead of a query, the system accepts a document. Instead of a list of moments, it returns a tree.

The product feels different from search because it is different from search. Search makes the user do the indexing in their head. Document-driven retrieval lets the user hand the indexing problem to the system. The user reads their own study guide, and every section is already wired to the moments. That is the difference between a search box and intelligence.