Scaling and Cost Optimization for AI Video Pipelines

The cost surprise

A team builds a video search feature for a client. The pilot runs on twenty hours of footage. The query path costs fractions of a cent per request. Somebody multiplies that number by expected query volume, adds a comfortable margin, and ships a budget to finance.

The system goes to production. The corpus is not twenty hours. It is eight thousand hours of archived material the client wants searchable from day one, plus two hundred new hours every month. The retrieval bill comes in roughly where the team predicted. The embedding bill, which nobody modeled carefully, lands at five to ten times the entire approved budget.

This is the most common cost mistake we see in video AI work. The team optimized for the wrong axis. Retrieval is a per-query cost, and per-query costs are visible — every search shows up in logs. Ingestion is a per-asset cost, and per-asset costs are invisible until somebody runs the full corpus through them. The architecture we walked through in From Transcript to Intelligence tells you how the pipeline works on one video. This article is about what changes when there are tens of thousands.

Where the cost actually lives

The math is simple once you write it down.

A one-hour video produces roughly 8,000 to 12,000 words of transcript, which chunks into 100 to 200 retrieval units of 200 to 500 tokens each. Embedding those chunks costs roughly one to three cents per hour of video on Titan, and three to eight cents per hour on Cohere Embed v3. Transcription is separate and depends on provider — call it ten to thirty cents per hour for production-grade accuracy.

Multiply by ten thousand hours.

Transcription: $1,000 to $3,000. Embedding: $100 to $800. Storage: small. Reasoning on a query: a fraction of a cent. Vector search on a query: an even smaller fraction of a cent.

The shape of the bill is now visible. Ingestion is a one-time spike in the thousands of dollars per ten thousand hours of corpus. Retrieval is a recurring cost in the cents per thousand queries. A team that budgeted for the retrieval cost and got blindsided by the ingestion cost was reading the wrong number on the right page.

The optimization work follows from this. The places where money is at stake, in order: transcription provider choice, embedding model choice, chunk count, reasoning model on the query path, and — distantly — everything else.

Ingestion that survives ten thousand hours

The ingestion pipeline we deploy is event-driven end to end. The shape:

S3 upload → Transcription job → Chunking → Embedding → Vector write
     ↓             ↓                 ↓           ↓           ↓
   SQS           SQS               SQS         SQS         DLQ

Each stage has its own SQS queue. Each stage has its own dead-letter queue. The whole thing scales because Lambda handles the orchestration and each stage scales independently of the others.

Three things go wrong at this volume that do not go wrong on twenty hours of pilot data.

Individual videos fail in ways that block nothing else. A 47-minute file is corrupted at minute 31. The transcription provider returns an error. Without per-stage queues, that one failure cascades — the worker holds the slot, the queue backs up, throughput collapses. With per-stage queues and DLQs, the failed item lands in a queue an operator can inspect, and the next nine hundred ninety-nine videos in the batch are unaffected.

Throttling becomes a structural concern, not an exception. Bedrock has rate limits. Transcription providers have rate limits. At ten thousand hours of ingestion, you will hit them. The pipeline must back off with jitter (the same pattern we covered in Production AI Pipelines on AWS) and the embedding stage in particular must batch — sending one chunk per API call leaves most of your rate limit unused.

Reruns happen. Something will change — the chunker, the embedding model, the metadata schema — and you will need to reprocess. The ingestion pipeline must be idempotent. Each video carries a content hash; each chunk carries a deterministic ID derived from videoId + chunker version + chunk index. Re-running the pipeline over an already-processed video produces the same IDs, so writes are upserts, not duplicates.

Embedding economics

Embedding is where most of the ingestion bill lives, and it is where small choices compound.

Chunk size. A chunker that produces 300-token chunks generates roughly 50% more chunks than one that produces 450-token chunks. That is a 50% increase in embedding calls, embedding cost, vectors stored, and index size — for the same hour of video. Chunking is not just a retrieval-quality decision. It is a cost decision.

The retrieval-quality sweet spot for video transcripts sits between 200 and 500 tokens. Going smaller hurts the bill more than it helps precision. Going larger erodes retrieval quality because chunks start covering multiple topics.

Batching. Bedrock embedding APIs accept arrays of inputs in a single call, up to provider-specific limits. Sending one chunk per call wastes the per-request overhead and hits rate limits faster. Sending 25 to 100 chunks per call is usually the practical maximum. The cost per chunk does not change, but the throughput goes up roughly an order of magnitude, which means a ten-thousand-hour reprocess completes overnight instead of over a week.

Model choice. Titan Embed is cheaper than Cohere Embed v3, by a factor of roughly two to three. For English-only corpora where retrieval quality is good enough, Titan is the right default. Cohere wins on multilingual content, where its retrieval precision is meaningfully higher and worth the cost. The wrong decision here, multiplied across a large corpus, is the difference between a $200 ingestion job and a $700 one.

Caching. If a video gets re-uploaded — same content, different filename — the content hash catches it before anything else runs. If a chunker version is bumped but the embedding model is unchanged, re-embedding is unnecessary for chunks whose text did not change. These are not exotic optimizations. They are bookkeeping that the pipeline needs to do once and then never think about again.

Vector index sizing

A pilot OpenSearch index with 50,000 vectors behaves nothing like a production index with 10 million. The configuration that worked on the pilot will quietly stop working.

Shard count. OpenSearch's default shard count is too low for a vector index that will grow into the millions. We size shards so each one holds 1 to 5 million vectors. Beyond that, query latency starts climbing because the HNSW graph per shard gets too large. Below that, you are paying for shard overhead that is not buying you anything.

Replica strategy. One replica per shard is the production minimum. Replicas serve reads, so they directly affect query throughput. For a query-heavy deployment, two replicas per shard is often the right answer. The cost is storage and memory; the gain is parallelism on the read path.

HNSW parameters. The m parameter (graph connectivity) and ef_construction (index build quality) trade build cost for query quality. The defaults are conservative. For a large corpus where retrieval quality matters, increasing m from 16 to 32 and ef_construction from 100 to 200 measurably improves recall — at the cost of slower indexing and larger memory footprint. Worth doing once, on a corpus that will be queried for years.

Memory sizing. This is where teams burn money quietly. OpenSearch holds the HNSW graph in memory for low-latency search. If the index does not fit, query latency degrades sharply. The rule of thumb: budget roughly 1.5 to 2 bytes per dimension per vector, plus overhead. Ten million 1024-dimension vectors is roughly 20 to 25 GB of memory just for the graph. Right-sizing the cluster up front is cheaper than discovering you under-provisioned three months in.

Retrieval latency at scale

The single biggest performance win at scale is not in vector search. It is in what runs before vector search.

Metadata filtering before kNN. A query that searches "all chunks in the corpus" is doing vector math against millions of candidates. A query that searches "chunks from this user's accessible videos, in this date range, in this language" might be doing vector math against ten thousand candidates. The latency difference is roughly an order of magnitude. The architecture decision is to make filtering cheap by indexing the right metadata fields with the right types, and to make sure every query carries the most selective filter the use case allows.

This is also the security boundary. As we noted in Transcript to Intelligence, permissions live on the chunk and are enforced as filters at query time. At scale, that filter is doing double duty — it is enforcing access and it is dramatically narrowing the search space. A query whose filter reduces the candidate set by 100x is a query that returns in 50ms instead of 500ms.

Reranking budget. Reranking adds 150 to 400 ms of latency. At scale, this is the single largest contributor to query latency after the model call. It is also the largest contributor to retrieval quality. The right answer is to keep reranking on for queries that matter, and to skip it for cheap traversal queries (autocomplete, related-items lookups) where the precision gain is not worth the latency.

Hybrid search overhead. Combining BM25 and kNN in one query adds modest latency — usually under 50ms — and routinely improves recall on edge cases that pure semantic search misses (proper nouns, identifiers, verbatim phrases). The overhead is small enough that hybrid is the right default once the corpus is large enough that pure semantic drift becomes visible.

Hot data and cold data

A production video corpus has a long tail. Some videos get queried daily. Most videos get queried once a quarter, if at all.

Treating hot and cold data identically is wasteful. The pattern we deploy:

Hot tier lives in the primary OpenSearch cluster, fully replicated, fully in-memory.
Warm tier lives in the same cluster but on a lower-resource node group with fewer replicas. Queries against warm chunks are slightly slower but cost less to host.
Cold tier lives in S3 as serialized vector files, optionally with a small Postgres index for metadata lookup. A query that needs cold data triggers a rehydration job, which is acceptable because cold queries are rare.

The classifier for which tier a video belongs to is not exotic. Last query time, query frequency over the last 30 days, age of the asset. Videos get demoted to warm after no queries for 90 days, demoted to cold after 365. Promotions happen automatically when a cold video is queried more than twice in a week.

The savings are meaningful. A corpus of 10,000 hours where 90% of queries hit 10% of the videos can run on a cluster sized for the hot tier plus a small warm overflow — often a third to half the cost of provisioning for the full corpus at hot-tier latency.

Reprocessing without downtime

The embedding model will change. The chunker will change. The metadata schema will change. At small scale, you take downtime, you reprocess, and you move on. At ten thousand hours, downtime is not an option.

The pattern is a versioned index alias.

The live system queries an alias — say, chunks-current. Behind that alias is an actual index, chunks-v3. When a new embedding model arrives, ingestion starts writing to chunks-v4 in parallel. New uploads go to v4. The backlog of existing videos gets reprocessed into v4 by a backfill job that runs as a background workload, throttled to leave headroom for live ingestion.

When chunks-v4 is fully populated and validated against a sample of canonical queries, the alias swap is atomic — chunks-current now points at v4, the old index gets kept for a rollback window, and eventually deleted.

This is operationally identical to a blue-green deployment for vector indexes. The cost is the period of double-indexing. The gain is that the user-facing system never goes down.

The same pattern handles chunker changes (write new chunks alongside old, swap when done) and schema changes (add the new field to incoming writes, backfill historical records, then update the read path).

Cost monitoring that matters

A scaled video pipeline needs cost visibility at three resolutions.

Per-video. When a video is ingested, the pipeline records token counts for transcription, embedding calls, and storage footprint. The result is a unit-economics number — "this video cost $0.18 to ingest" — that lets the team flag anomalies (a video that cost ten times the median, probably because it triggered a retry loop) and that lets product and finance reason about pricing.

Per-tenant. For multi-tenant deployments, every query, every ingestion job, every storage byte is tagged with a tenant ID. Cost rolls up by tenant for invoicing, for capacity planning, and for spotting tenants whose usage patterns are about to break a pricing model.

Per-query. Every query records its model calls, token counts, retrieval candidate count, and total latency. The aggregate is the per-query cost over time. The interesting signal is variance — a query class whose cost suddenly doubles is either a regression in retrieval (returning too many candidates) or an abuse pattern (a tenant running automated queries you did not plan for).

CloudWatch alarms fire on three thresholds: daily total cost (budget guardrail), per-tenant cost variance (catches abuse and runaway usage), and per-query latency P99 (catches retrieval degradation before users complain).

The principle that holds at any scale

The same architecture that runs the retrieve-first pipeline on one video runs it on ten thousand hours. What changes is where the cost lives, where the latency lives, and where the failure modes hide. The model on the query path stays roughly the same price per request. Everything upstream of the query — ingestion, embedding, indexing, reprocessing — becomes the engineering problem.

Teams that ship video AI features successfully at scale share one habit. They model the ingestion cost honestly before they ship the budget. They size the index for the corpus they will have in eighteen months, not the corpus they have today. They build reprocessing into the architecture from day one because the embedding model will change. And they monitor cost at the resolution that lets them catch the surprise before finance does.

The retrieval bill is the bill you see. The ingestion bill is the bill that decides whether the project survives.