Building AI-Powered Topic Generation for Long-Form Video

The two-hour podcast with no chapters

A user uploads a 118-minute conversation between two researchers. The transcript is 22,000 words. The video has no chapters, no show notes, no description longer than a sentence. Somebody asks the obvious question: "What is this episode actually about?"

A naive answer is a single paragraph summary. A slightly better answer is a list of timestamps every ten minutes. Neither is what the user wants. The user wants topics — the actual themes the conversation covered, in the order they were covered, with the ability to jump to each one. They want a table of contents that the producers never built.

Long-form video has this property by default. Sermons, lectures, podcasts, recorded courses, training sessions, panel discussions, founder interviews. The content is structured in the speaker's head and unstructured on disk. Generating that structure after the fact is a topic extraction problem, and topic extraction over a long, drifting, self-referential talk is not the same problem as topic extraction over a tidy article.

This is what we solve. The pattern we deploy uses semantic clustering of transcript chunks, LLM-assisted naming, and a clean separation between the moment layer and the topic layer. The result is a structured topic list the user can navigate and the system can search.

Why long-form video resists naive chaptering

Articles have headings. Books have chapters. Long-form video has neither, and the workarounds that approximate them break in predictable ways.

Talks drift. A speaker starts on the value of prayer, moves into a story about their childhood, returns to prayer, then connects it to leadership. A chapter every ten minutes catches none of these transitions. A chapter every paragraph break, if the transcription model emits them, catches too many. The boundaries the user cares about are semantic, not temporal.

Topics return. The most useful structural insight about a long talk is often "the speaker raised this idea at minute 12 and came back to it at minute 87." A linear chapter list cannot express this — it forces every topic into a single contiguous range. The result is either a chapter list that lies (collapsing two appearances into one block) or a fragmented chapter list that lists the same topic twice without acknowledging the connection.

Speakers self-reference. "As I said earlier," "going back to the point about pricing," "this connects to what we discussed twenty minutes ago." These are signals the system should use, not ignore. They are also exactly the signals a timestamp-based chaptering pass throws away.

Topics are not summaries of contiguous text. A topic is a coherent theme that may span discontinuous regions of the video. The data model has to allow this. Most chaptering implementations assume continuity, and the assumption is wrong.

The starting point for a system that handles long-form video is to accept that topics are an abstraction layer above timestamps. Build that abstraction, and the rest of the architecture follows.

The pipeline

The end-to-end shape is short:

Transcript chunks → Embeddings → Semantic clustering → LLM-assisted naming
                                                                ↓
                                              Topic entities + grouped Moments
                                                                ↓
                                                       Topic summaries

The first two stages are already done if the system is built on the retrieval architecture from From Transcript to Intelligence. The chunks and embeddings exist. Topic generation reuses them. There is no second indexing pass.

Clustering. The chunk embeddings from the existing index get clustered. We use HDBSCAN over the chunk vectors — density-based clustering picks up clusters of varying size and tolerates noise points, which matches how real talks are shaped. Some topics span fifteen minutes, others span ninety seconds. Some chunks belong to no topic in particular (transitions, asides, banter), and HDBSCAN labels these as noise rather than forcing them into a cluster. K-means would have to be told the number of topics in advance, and that number is exactly what the system is trying to discover.

The clustering operates on chunks, not on raw transcript segments. The chunk is the right granularity — 30 to 90 seconds of coherent speech, which is roughly the unit at which topics shift. Segment-level clustering would be too noisy; clustering the whole transcript would defeat the purpose.

Output of clustering. The result is a list of cluster IDs, each carrying the chunk IDs assigned to it, plus a noise bucket. The clustering step does not name topics. It identifies which chunks belong together.

A clustering output looks like this:

{
  "videoId": "vid_4218",
  "clusters": [
    {
      "clusterId": "c_001",
      "chunkIds": ["vid_4218_c_003", "vid_4218_c_004", "vid_4218_c_005",
                   "vid_4218_c_028", "vid_4218_c_029"],
      "spans": [{ "start": 184, "end": 412 }, { "start": 1620, "end": 1745 }],
      "size": 5
    },
    {
      "clusterId": "c_002",
      "chunkIds": ["vid_4218_c_011", "vid_4218_c_012", "vid_4218_c_013"],
      "spans": [{ "start": 720, "end": 1080 }],
      "size": 3
    }
  ],
  "noise": ["vid_4218_c_001", "vid_4218_c_017", "vid_4218_c_034"]
}

The spans field already captures the property a chapter list cannot: cluster c_001 appears twice in the video, once between minutes 3 and 7, once between minutes 27 and 29. The data model carries this naturally because a topic is a set of chunks, not a range.

LLM-assisted naming. Clustering produces unnamed groups. Naming is a reasoning task. For each cluster, we send the model the representative chunks (the medoid of the cluster, plus a few of its closest neighbors) and ask for a short topic name and a one-line description. The model never sees the whole transcript. It sees a narrow, dense window — the highest-similarity chunks for that cluster — and produces a name grounded in that material.

The prompt is strict about format. It returns a JSON object with name, description, and an optional confidence field that lets downstream code filter out weakly named clusters. Loose prompts at this stage produce topic names like "discussion of various points," which is the kind of output that survives a demo and embarrasses the team in production.

Grouping into Moments. A topic is not a list of raw chunks for the user. It is a list of moments, where each moment is a contiguous span of video belonging to that topic. The moment construction step walks the cluster's chunks in timestamp order, merges adjacent chunks into a single moment, and emits a new moment whenever the gap between consecutive chunks exceeds a threshold (typically 60 seconds). The output is the Moment entity described in AI-Assisted Video Moments — a topic's chunk set, rendered into the playable units the UI actually consumes.

The Topic entity

The data model is small. A Topic owns a name, a summary, an ordered list of Moments, and metadata about its provenance.

{
  "topicId": "tpc_4218_002",
  "videoId": "vid_4218",
  "name": "Corporate prayer",
  "description": "Framing prayer as a corporate, not just individual, act.",
  "summary": "The speaker argues that prayer in scripture is rarely depicted as a solitary act. The early church gathered to pray; major decisions in Acts are prefaced by communal prayer. The implication for modern congregations is that personal devotion, while necessary, is incomplete without the corporate dimension.",
  "moments": [
    { "momentId": "m_4218_011", "start": 720, "end": 786 },
    { "momentId": "m_4218_012", "start": 1145, "end": 1212 },
    { "momentId": "m_4218_029", "start": 4980, "end": 5102 }
  ],
  "confidence": 0.87,
  "generatedAt": "2026-05-10T19:14:00Z",
  "modelVersion": "claude-3-5-sonnet-20250214"
}

Three properties of this schema are doing real work.

The moments field is an ordered list, not a single range. The third moment at 4980 to 5102 is the speaker returning to the topic an hour and twenty minutes after first raising it. A timestamp-based chapter list could not represent this. A topic-based model represents it naturally.

The summary field is generated from the moments, not from the raw transcript. The summarization step receives the moment summaries (which were generated when the moments were created) and produces a topic-level summary on top of them. This is a deliberately hierarchical pattern — the same one described in the cross-video search section of From Transcript to Intelligence — and it keeps the input to the reasoning model narrow at every stage.

The modelVersion field is provenance. When the team upgrades the naming or summarization model, the system can re-run topic generation on existing videos and know which records were produced by which model. This is the difference between an AI feature that ages well and one that quietly degrades because nobody knows which outputs are stale.

Topic summaries are summaries of moments, not of transcripts

The most common mistake at this stage is to summarize the raw transcript chunks for each topic. It produces summaries that sound right and miss the point.

The cluster's raw chunks contain the topic, but they also contain noise — the speaker's verbal tics, false starts, side comments, the listener's laughter, the host interjecting. A summary of the chunks gives weight to all of it. A summary of the moments — which were already produced by the moment-extraction step and already filtered for content — concentrates on what the speaker actually said about the topic.

The hierarchy is the architecture:

Chunks (raw)
   ↓
Moments (chunks filtered + summarized)
   ↓
Topics (clusters of moments + named + summarized over the moment summaries)

Each layer is a narrower, denser view of the one below it. The reasoning model never sees raw chunks during topic summarization. It sees the moment summaries that already passed through one round of distillation. The summary that emerges is correspondingly tighter.

Failure modes

Three failure modes show up in production. Each one has a fix that is structural, not a prompt tweak.

Topic sprawl. The clustering produces 40 topics from a 90-minute lecture. Most are tiny — two or three chunks each — and the user is buried in a topic list longer than the chapter list of a book. Topic sprawl usually means the HDBSCAN parameters are too permissive (min_cluster_size set too low) or the embeddings are picking up superficial similarity (filler phrases, common transitions). The fix is a minimum cluster size enforced at the data model layer — clusters below the threshold get folded into the noise bucket and rendered as ungrouped moments rather than promoted to topics.

Topic collapse. The opposite failure. The clustering produces three topics from a 90-minute lecture that actually covered seven. This usually means the embeddings are too coarse — Titan default settings, no domain adaptation — and chunks about distinct sub-themes look identical in vector space. The fix is two-pass clustering: cluster once, then recluster the largest cluster's chunks against each other to find sub-topics. Hierarchical structure surfaces naturally when the system is willing to look for it.

Topic drift between similar talks. A speaker gives the same 30-minute talk twice on the same tour, with small variations. The first run produces a topic named "Pricing models for small businesses." The second run, on similar but not identical content, produces "How to charge for software." Same topic. Different names. The system cannot link them.

The fix is a canonical topic store, populated over time and consulted at naming time. After the LLM proposes a topic name, the system embeds the name plus the cluster's representative chunks and runs a nearest-neighbor search against the existing topic store. If a sufficiently similar canonical topic exists, the new topic inherits its name and links to it. If not, the new topic gets added to the store as a fresh canonical entry. Over months, the topic taxonomy stabilizes without anybody curating it by hand.

Why this beats timestamp-based chaptering

Three properties make this approach more useful than the chapter-every-10-minutes pattern that ships in most video platforms.

Topics can be discontinuous. A speaker who returns to an idea later in the talk shows up as a topic with multiple moments, not as two unrelated chapters.

Topics carry semantic identity. The system can search by topic — "find every video in the library that discussed corporate prayer" — which is a vector search against the topic store, not a fragile keyword match against chapter titles.

Topics survive editing. If the underlying chunks change (because the transcript was re-processed, or the moment extraction was improved), the topic regenerates from the same cluster IDs. Chapter timestamps would have to be re-marked. Topics regenerate automatically.

Where this lives in the product

A topic list is the table of contents the producer never built. The same data structure feeds three different surfaces.

Chapter list in the player. Topics render as a navigable side panel. Clicking a topic shows its moments. Clicking a moment seeks the player. The list is generated, not authored, and the user does not need to know.

Search-by-topic across the library. A user looks for "every podcast episode that covered remote work culture." The query embeds, runs against the topic store, and returns a list of topic IDs across the entire library. Each topic links back to its moments, which link back to their videos. The user gets a cross-library result set without anybody having tagged anything.

Automated show notes. The topic list, with its summaries, is already structured prose. A small reasoning step formats it into a show-notes document. The producer reviews and ships it. What used to take an editor 45 minutes per episode takes five.

Topics are the abstraction layer that makes long-form video usable. Moments are what the user plays; topics are how the user navigates. Build them as separate entities, generate them from clusters rather than from timestamps, and summarize them from the moments below them rather than from the transcript above them. The result is a product that turns a two-hour talk into a thing the user can actually find their way around — which is what every team that ships long-form video discovers, eventually, they were supposed to be building the whole time.