The cleanup lifecycle policies cannot do
The previous article in this series covered how to keep S3 storage bounded going forward with lifecycle policies — expiring noncurrent versions, transitioning cold data, cleaning up incomplete uploads. The policies are powerful. They have one limitation that becomes obvious the first time you point them at a bucket with a long history: they only act on objects as they age past the configured threshold, on the daily cadence S3 runs lifecycle evaluation.
For a bucket carrying years of accumulated history — millions or billions of objects, terabytes of versions older than anyone remembers, prefixes that nobody owns anymore — lifecycle policies eventually clean it up. "Eventually" can mean weeks or months at large scale, depending on the volume and the rules.
When the business needs the cleanup done in a defined window — before a compliance audit, before a migration, before the next quarterly cost review — lifecycle policies alone are not the right tool. S3 Batch Operations is.
What S3 Batch Operations actually is
S3 Batch Operations is an execution engine for actions across very large object sets. It is not a discovery tool. It does not scan the bucket and decide what to do. It takes two inputs: a manifest (a list of object keys, optionally with version IDs) and an operation (delete, copy, restore, replace tags, invoke a Lambda, and a few others). It then performs the operation on every object in the manifest, at scale, in parallel, with retry handling and a structured completion report.
The strength of the model is the separation: you decide what to act on; Batch Operations acts on it. The manifest can come from anywhere — a SQL query against an S3 Inventory report, a script that walks the bucket, a curated CSV from a compliance review. The operation runs against exactly that set, no more.
This is also where teams unfamiliar with the tool sometimes lose track. The decisions about what to delete have to be made before Batch Operations runs. Once the job starts, it does the thing. There is no "review what would have happened" mode during execution.
The end-to-end cleanup flow
A real enterprise cleanup, done correctly, has four phases. Each one matters; skipping any of them is how teams accidentally delete data they needed.
Phase 1: Inventory
S3 Inventory is the discovery layer. It runs on a schedule (daily or weekly) and writes a structured report of every object in the bucket — including all versions — to a destination bucket. The report includes object keys, version IDs, storage classes, sizes, last modified dates, ETags, and any metadata you configure.
For a cleanup, the configuration that matters is IncludedObjectVersions: All. Without this, the report covers only current versions. The historical accumulation you are trying to clean up is invisible.
The inventory report is the single source of truth for the rest of the process. Every subsequent decision — what to delete, what to retain, what to migrate — references the inventory rows.
Phase 2: Analysis
This is the phase that most cleanup projects underinvest in, and it is the phase that decides whether the project succeeds.
Load the inventory into a queryable form. For small to medium inventories, S3 Select against the inventory file in place works. For larger inventories, Athena over the inventory bucket gives you SQL access at scale. The queries that matter:
- Total bytes broken down by prefix, storage class, age
- Noncurrent versions older than the retention threshold
- Objects that are current but have not been accessed in a defined window (requires S3 Server Access Logs or S3 Storage Lens)
- Prefixes with no owner — applications that were retired, projects that ended
- Multipart upload fragments older than 7 days
Each category is a candidate for cleanup, and each has a different decision-maker behind it. Noncurrent versions older than the retention policy are an engineering decision. Old data from a retired application is a business decision. Cold data from an active application is a finance and product decision. Get the right person to sign off on each category before the manifest is built.
Phase 3: Manifest construction
The manifest is a CSV containing the bucket name and key for each object — and for version-specific deletes, the version ID as a third column. The format is simple. The selection logic is the entire decision.
The pattern we use is to generate the manifest as the output of an Athena query against the inventory. The query encodes the cleanup decision. Saving the query alongside the manifest preserves the rationale — when an auditor or a future engineer asks "why did we delete these specific objects in this specific job," the answer is in the SQL.
Before submitting the manifest to Batch Operations, two safety checks:
- Sample the manifest. Pick a few hundred objects at random and verify they match what the query was supposed to find. The cost of validation is minutes; the cost of finding out after the fact is significantly higher.
- Set up dry-run validation if possible. For destructive operations, configure the Batch Operations job to use a Lambda invocation as a first pass — a Lambda that logs each object it would have deleted to a separate report, without actually deleting. Run the dry-run, audit the report, then submit the real job.
Phase 4: Execution and validation
Submit the job. Batch Operations runs through the manifest in parallel, with a configurable concurrency. For very large jobs, the throughput is limited by S3's request rate per prefix; for a heavily-prefixed bucket, parallelism is high. For a single-prefix bucket, throughput is capped by the per-prefix limit (5,500 deletes per second per prefix as of writing).
The job emits a completion report listing every object processed, the action taken, and any errors. Errors are the most important part — failed deletes (typically due to MFA-protected objects, object lock retention, or permissions issues) need to be addressed individually. The report is the audit trail.
Validate after execution by re-running the inventory and confirming the bucket size dropped by the expected amount. This sounds obvious. It is the step most often skipped, and the step that catches the cases where the job ran but the savings did not materialize because the wrong objects were selected.
The patterns that hold up at enterprise scale
Three patterns we deploy on every large cleanup engagement.
Tier the cleanup. A bucket with terabytes of bloat does not need to be cleaned up in one job. The first job targets the highest-value, lowest-risk category — typically multipart upload fragments and noncurrent versions older than two years. The savings show up immediately. The next job tackles the next tier. The phasing builds confidence with finance and security stakeholders, and any mistake in an early job is contained.
Keep a recovery window before deletion. For destructive operations on large data sets, the pattern that buys you the most safety is to first copy the candidates to a separate, lifecycle-policied "quarantine" bucket where they will expire automatically after 30 or 60 days. Then delete from the source. If something turns out to have been needed, the recovery window is large enough that the team finds out. After the window closes, the quarantine bucket cleans itself up via lifecycle policy. Total cost: a month or two of duplicated storage on the targeted subset. Total value: a documented escape hatch.
Document the manifest with the rationale. A Batch Operations job that deletes 500 million objects two years from now is going to come up in an audit. The audit team will ask why. The right answer is "here is the SQL query that built the manifest, the inventory it ran against, and the sign-off from the data owner." That trio takes minutes to assemble at the time and is impossible to reconstruct after the fact.
When Batch Operations is not the right tool
For ongoing maintenance — keeping the bucket clean rather than catching up — lifecycle policies are the right tool. Batch Operations is for the one-time catch-up and for periodic large-scale reorganizations.
For migrations between storage classes, Batch Operations does work and is sometimes used. For most cases, a transition rule in a lifecycle policy is simpler.
For copying or restoring objects across regions or accounts, S3 Replication and S3 Cross-Region Copy are usually the better choice. Batch Operations supports copy as an operation, but the replication features are more purpose-built for migration use cases.
The three-layer model
After three articles, the structure of a coherent S3 cost strategy is:
- S3 Inventory is the visibility layer. You cannot manage what you cannot see.
- Lifecycle Policies are the automation layer. They keep new and aging data managed in the background, every day.
- Batch Operations is the execution layer. It handles the once-or-twice-a-year jobs that lifecycle policies cannot do at speed.
Each layer fills a gap the others leave. Most teams have none of them. The teams that have all three have S3 bills that are explainable to finance, defensible to audit, and stable to plan against. That is what the work is for. Storage is cheap when it is managed. The bill that surprises you is always the one that ran on autopilot for too long.
