Building AI-Assisted Video Moments from Natural Language

The output that ends the demo

A product manager sits down to test the new AI video feature. They type "show me every moment discussing pricing" into the search box. The system thinks for two seconds and returns a paragraph: The video discusses pricing in several sections, including an early framing of the topic and a later comparison with competitors. The speaker explains the rationale behind the pricing model and addresses common objections.

The paragraph is accurate. It is also useless. The user cannot press play on a paragraph.

This is the failure that distinguishes AI video features that ship from the ones that get rebuilt. The retrieval layer can be flawless. The reasoning model can produce a beautifully written summary. If the output is prose rather than a list of clickable clip ranges, the product does not work. The user came to find moments, not to read about them.

The fix is the data shape of the response. The unit of output is not a sentence. It is a moment — a discrete, playable span of video with a start time, an end time, and a small amount of metadata. Everything else in the architecture serves that primitive.

What a moment is

A moment, in our model, is one of two things. A mark is a single timestamp — the user (or the system) flagged a point of interest at second 1432. A clip is a range — second 720 to second 786 covers the discussion of corporate prayer. Both are moments. The difference is whether the moment has a duration.

Moments also carry a generation type. A user-generated moment came from someone explicitly marking it — a pause, a highlight, a manual selection in the UI. An AI-generated moment came from a query that produced a retrieval result. The system treats both as first-class entities. A user can edit an AI-generated moment. An AI-generated moment can be promoted into a curated playlist alongside user-generated ones. The downstream tooling — analytics, highlight reels, course outlines — does not care which kind of moment it is reading.

This sounds like a small modeling decision. It is structural. The minute moments are first-class entities, the rest of the system stops treating AI output as ephemeral text and starts treating it as data the user can act on.

The pipeline

The end-to-end shape of generating a moment from a natural-language query is six steps. None of them are the model writing prose.

Query → Embedding → Semantic retrieval → Moment construction
                                                ↓
                                       Topic structuring
                                                ↓
                                          Persistence

The retrieval step is the same retrieval pattern that powers the rest of the platform — we have written about it in Retrieve First, Reason Second. The query is embedded with the same model used at ingestion. A vector search over the chunk index returns the top candidates, narrowed by metadata filters and reranked for precision. The output of retrieval is a small list of chunks with their timestamps intact.

The moment construction step is where the architecture diverges from a generic RAG pipeline. The reasoning model does not receive an instruction to "answer the question." It receives an instruction to produce a list of moments in a specific JSON schema, with one moment per distinct span of relevant content within the retrieved chunks. The schema is fixed. The model fills it in.

Why structured output beats free text

The argument for structured output is operational, not aesthetic.

Free text is unparseable in the general case. A model that returns "the pricing discussion happens around 12 minutes in and again near the end" is producing a sentence a human can read. It is producing nothing a frontend can render as a clip list. Parsing prose back into structured data is a fragile and unreliable second pass — sometimes the model says "around 12 minutes," sometimes "12:04," sometimes "the twelfth minute," and the regex you wrote last week breaks on the new variant.

Structured output makes the contract between the model and the system explicit. The model is told the exact shape of the response. The model returns that shape. The frontend renders it. If the model fails to produce valid JSON, the system catches the parse error and retries — or surfaces a clear failure state — rather than rendering broken prose.

On Bedrock, this is enforced two ways. The prompt includes the schema and an instruction to return only valid JSON conforming to it. The response is parsed and validated against a Zod schema on the server before it leaves the API. If validation fails, the request is retried with a small temperature reduction. If retries fail, the moment is not generated and the user sees a graceful empty state.

The schema itself is small enough to write once and live with:

{
  "videoId": "vid_4218",
  "moments": [
    {
      "momentType": "clip",
      "start": 720,
      "end": 786,
      "summary": "Opening framing of the pricing discussion.",
      "contentAnchor": "Let's talk about how we think about pricing.",
      "confidence": 0.92,
      "topicId": "pricing_intro"
    },
    {
      "momentType": "clip",
      "start": 1145,
      "end": 1212,
      "summary": "Comparison with the previous pricing structure.",
      "contentAnchor": "The old model worked like this.",
      "confidence": 0.88,
      "topicId": "pricing_comparison"
    }
  ]
}

momentType distinguishes marks from clips. contentAnchor is the phrase used to re-derive the timestamp from the original transcript — the technique we cover in The Hidden Complexity of Timestamp Accuracy. topicId is the cluster the moment belongs to. confidence is a float between zero and one that travels with the moment all the way to the UI.

Topic structuring: when one query returns scattered moments

A query like "pricing" rarely returns one cohesive block of video. It returns three to twelve moments scattered across the runtime. Some are the speaker introducing pricing. Some are a comparison with competitors. Some are objection handling. Some are a tangential mention that the retrieval layer surfaced because it matched on embedding similarity but is not really the same topic.

A flat list of twelve moments is overwhelming. The user looks at it, scrolls, gives up. The interface needs structure.

The topic structuring step is a second reasoning pass. The model receives the list of moments and is asked to cluster them into topics — typically two to five clusters per query. Each cluster gets a topic label and a brief description. The moments inherit their topic ID from the cluster they were assigned to. The output the user sees is a grouped list: "Pricing introduction" (two moments), "Pricing comparison" (three moments), "Customer objections" (one moment).

The reasoning model is the right tool for this because the clustering needs to respect semantic meaning, not just text similarity. A moment that talks about "what we charge" belongs in the same cluster as one that talks about "the price point," even though the surface text shares no keywords. An embedding-based clustering would also work, and we use it as a fallback when the model output cannot be parsed — but the model-driven version produces cleaner topic labels because it generates them, rather than picking the centroid token of a cluster.

The topic step is also where near-duplicate detection happens. Two moments that overlap by more than seventy percent, or that the model identifies as covering the same conceptual content, collapse into one. The merged moment takes the wider time range and the higher confidence score. The user sees one clip, not two near-identical ones competing for attention.

Persistence: the moment becomes an entity

Once moments are constructed and clustered, they get persisted. This is the step most pipelines skip — they treat AI output as transient and re-generate it every time the user runs the same query. That is wasteful and inconsistent. The same query at noon and the same query at midnight should return the same moments. They should also be editable.

A moment in the database carries the fields above plus provenance: which query produced it, which model version, which embedding model, when it was generated. The user can edit any moment — trim the range, fix the summary, reassign the topic — and the edit is preserved. Subsequent queries that surface the same chunk return the user's edited version, not a fresh AI generation. The system learns from corrections without retraining.

The persistence model is two tables, roughly. Moments holds the moment entities. Topics holds the clusters. A foreign key joins them. A separate table tracks which queries produced which moments, so the system can show the user "this clip was generated from your search for 'pricing'" and recover the context if they want to refine the query.

The persistence step also enables the features that justify the whole architecture: highlight reel generation, course outline extraction, automatic chaptering. All of them read moments. None of them care whether the moment came from a query, a manual mark, or an edit.

UX challenges that decide whether the feature is loved or tolerated

Generating moments is the engineering problem. Presenting them is the product problem. Three patterns matter more than the rest.

Confidence indicators. Not every moment is equally certain. The retrieval might have surfaced a chunk that matched on embedding but is a stretch semantically. The model might have produced a moment with a content anchor that does not align cleanly to a transcript segment. The system knows when this happens. The UI should too. A confidence score of 0.92 is rendered without comment. A score of 0.7 is rendered with a subtle visual marker — a lighter outline, a "less certain" badge, a slight de-emphasis. The user is told what the system thinks, not just what the system found.

Overlapping moments. A clip from 720 to 786 and another from 760 to 840 overlap by 26 seconds. They are not the same moment, but they are not entirely distinct either. Showing them as two separate items in the list creates the impression that the system found two moments where there is really one and a half. The UI groups overlapping moments visually — a tighter spacing, a bracket that spans both, a tooltip that explains the relationship. The user sees the structure of the retrieval, not a flat list of arbitrary cuts.

Near-duplicate detection that survives editing. When a user edits the summary of a moment, the system needs to keep the moment distinct from a near-duplicate it might otherwise collapse. Edited moments carry an "edited" flag that suppresses automatic deduplication. The system trusts the user's edit over its own clustering.

What to show when retrieval is uncertain. A query that returns no moments above a confidence threshold should not return an empty list with no explanation. It should return a structured uncertainty state: "The video does not appear to discuss pricing in detail. The closest mentions are these two short moments, both with low confidence." The user understands what the system did and did not find. They do not assume the feature is broken.

The principle that compounds

Treating the moment as the unit of work changes what the rest of the system does. Search returns moments. Editing modifies moments. Analytics aggregate moments. Highlight reels assemble moments. Course outlines map document sections to moments. The retrieval pipeline ingests video once, indexes once, and then every feature is a different way of producing or consuming moments over the same chunks.

That is the payoff. The unit of work is small enough to model precisely, structured enough to render reliably, and persistent enough to learn from. The model writes JSON, not prose. The frontend renders a clip list, not a paragraph. The user clicks play, and the video starts on the speaker's word, ends at the end of a sentence, and lasts exactly long enough to cover the topic they asked about.

That is what an AI video feature looks like when it ships. Not a summary of what the video probably contains. A list of moments the user can press play on.