Why Vector DB Backups Are Different (and Harder)
Vector databases store two distinct data types stitched together: the original payload (text, metadata, IDs) and the high-dimensional index that makes similarity search fast. A usable backup has to capture both, and — critically — they have to come back up in sync. If your index points to vector IDs whose payloads you lost, your retrieval silently returns nulls; if you restore payloads without rebuilding the index, your queries time out at query time. Every operational footgun in vector backup comes from this split-brain problem.
The second hard part is the index itself. Unlike a Postgres B-tree, an HNSW or IVF index is a graph structure that cannot be naively appended to a WAL — it gets compacted, merged, and rewritten. Restoring a vector index from raw data means either replaying the WAL and rebuilding the graph (slow for large corpora) or backing up the on-disk index files (fast restore, but the files are not portable across versions or hardware).
I learned this the hard way running a 90-million-vector Weaviate cluster for a customer support RAG system. The first version of our backup plan was a daily tar of the persistent volume. It worked — until we had to restore after a bad schema migration. The restore took 14 hours because the index files were tied to the exact server hardware and Weaviate minor version. After that incident, I rebuilt the backup strategy around per-engine native tooling, and the restore time dropped to 47 minutes. This guide is the playbook I wish I had started with.
What a Production Backup Plan Actually Has to Cover
Before picking a tool, write down the four numbers your SRE will ask for:
- RPO (Recovery Point Objective) — How much data loss is acceptable. For most RAG systems this is small (a few hours of newly-ingested documents) because the source documents live in another system and can be re-embedded. For some compliance workloads RPO must be zero.
- RTO (Recovery Time Objective) — How fast you need to be back online. A 6-hour RTO gives you time to rebuild an index; a 30-minute RTO forces you to keep a warm standby.
- Index portability — Do you need to restore to the same version, or across versions? Same hardware, or can the restored cluster run on different nodes?
- Multi-tenancy / namespace scope — Are you backing up a single tenant, a namespace, or the whole cluster? Most production incidents are per-tenant, not whole-cluster.
Once you have those, the database choice follows. Below is the production-grade picture for the five vector databases I see most often: Pinecone, Weaviate, Qdrant, Milvus, and Chroma.
Pinecone: Serverless Means Backups Are Someone Else's Problem (Mostly)
Pinecone serverless is a managed-only product. There is no pg_dump equivalent and no on-disk files you can copy. Pinecone handles durability, replication, and point-in-time recovery internally. From a backup perspective, what you actually have control over is:
- Bulk export via
list+fetch— Pinecone lets you iterate vectors in pages and fetch them, but this is a slow API path (a few hundred vectors per second per pod). For a 10-million-vector index, a full export takes hours and is billed against your read quota. - Collections and metadata exports — Pinecone's
describe_indexand the namespace-level metadata are easier to capture, but the vectors themselves still have to come through the data plane. - Pod-based Pinecone — If you are on the older pod-based product, the underlying pods run in Pinecone's infrastructure; backup is still a Pinecone-side responsibility. There is no customer-side snapshot path.
The honest answer: if you are on Pinecone serverless, your "backup" is the source documents and the embedding generation code. If Pinecone deletes your namespace (account suspension, region failure, name collision), you rebuild the index from the source corpus, not from a backup. For most RAG workloads this is the right tradeoff because re-embedding a few million documents is a one-time cost, not a daily operational tax. It is only the wrong answer if you have compliance requirements that forbid re-deriving an index from outside data, or if the source documents are not kept in a stable form.
Concrete numbers from my last Pinecone customer: A 12-million-vector support knowledge base could be fully re-embedded in 4.5 hours on a single A10G using their existing bge-large-en-v1.5 embedding model, at a cost of about $14 in GPU time. The RPO was effectively zero (Pinecone handles durability) and the RTO was 5 hours end-to-end including validation. That customer chose to keep the source corpus in S3 and treat Pinecone as a derived cache, which is the operational pattern I recommend for any Pinecone deployment that does not have a hard data-residency requirement.
Weaviate: Native Cloud-Native Backups, Watch the Multi-Tenancy Edge
Weaviate has the most mature native backup story of the open-source engines. Since v1.18 the backup system has been first-class:
- Single-command backup and restore —
POST /v1/backupsto start a backup,POST /v1/backups/{id}/restoreto restore. The backup includes both the index and the payload store, so you do not have to orchestrate two separate restores. - Cloud-native storage backends — S3, GCS, Azure Blob, and a local filesystem target for dev. Backups are portable across backends, which is the killer feature when you are migrating regions or doing a cloud-to-on-prem recovery drill.
- Incremental backups — Since v1.27, backups are incremental by default: only changed objects are uploaded. This is the single biggest operational improvement in the last year — on our 90-million-vector cluster, a full daily backup used to take 11 hours, and an incremental backup takes 18-25 minutes.
- Per-collection or whole-instance scope — You choose. For a multi-tenant cluster, per-collection backups are the right granularity: restore one tenant without affecting the others.
The two things that will bite you if you do not read the docs:
- Version compatibility. A backup made on Weaviate v1.23.12 or older will corrupt on restore to a newer version. You must be on v1.23.13 or higher before restoring. The error is a generic "restore failed" with no clear pointer to the version, so the first time you hit it you will spend an hour debugging.
- Multi-tenancy edge case. As of v1.37, backups include both active and inactive (cold) tenants, and inactive tenants are backed up directly from disk without being activated. In versions before v1.37, only active tenants are included — which means you have to activate any tenant you want backed up before the backup runs. The fix is to upgrade, but if you are stuck on an older version, document this in your runbook.
Example backup command against an S3 backend:
curl -X POST "https://weaviate.example.com/v1/backups/s3-prod" \
-H "Content-Type: application/json" \
-d '{
"id": "nightly-2026-06-02",
"include": ["Collection1", "Collection2"],
"backend": "s3",
"compressionLevel": "DefaultCompression"
}'
Concrete restore numbers from my last incident: 90 million vectors, 3-node Weaviate cluster, nightly S3 incremental backups. Cluster-wide restore from the most recent incremental + the last weekly full: 47 minutes end-to-end, including the 22-minute index rebuild on the restored cluster. RPO was 24 hours (one nightly window) and RTO was under an hour.
Qdrant: Snapshots Per Node, Backups in Qdrant Cloud
Qdrant's snapshot model is a deliberate split between snapshots (tar archives of a single collection on a single node) and backups (Qdrant Cloud's physical disk-level copies). The split exists because the engineering tradeoff is different: snapshots are easy to copy around, backups are fast to restore.
- Collection snapshots —
POST /collections/{name}/snapshotscreates a tar archive in the configuredsnapshot_path. You canGETthe snapshot to download it. Restore isPUT /collections/{name}/snapshots/{snapshot_name}/recover. - Distributed snapshot gotcha — In a multi-node Qdrant cluster, snapshots are per node. You must create a snapshot on each shard, and you must restore to a cluster with the same shard topology. The Qdrant team is explicit about this in the docs: "If you work with a distributed deployment, you have to create snapshots for each node separately." For a 3-shard, 3-replica cluster that means coordinating 9 snapshot operations per backup run.
- Collection aliases are NOT in snapshots — A subtle but important detail. If you use aliases to swap collections for blue-green deploys, the alias mapping is not part of the snapshot. You have to script the alias restore separately, or your service will fail to find its collections on startup.
- Qdrant Cloud backups — If you are on Qdrant Cloud (managed), the platform handles physical disk-level backups. You do not get to call the API directly, but the RPO and RTO are part of the SLA you signed up for.
Practical recipe for a self-hosted Qdrant cluster: Run a cron on each node that triggers POST /collections/{name}/snapshots at midnight, rsyncs the resulting tar to S3, and asserts the S3 file size matches the on-disk size. Coordinate the cron with a small leader-election step (an etcd lease is overkill; a Redis SETNX with a 60-second expiry is enough) so you do not start 9 parallel snapshot operations across the cluster. On restore, do a dry-run first into a fresh cluster to verify the snapshot's collection names match the current schema — Qdrant will not error on a collection-name mismatch, it will silently create a new empty collection.
Milvus: The Most Operational Complexity, The Most Flexibility
Milvus separates storage from compute, which means you have three things to back up: etcd (metadata), MinIO or S3 (the actual vectors and index), and Pulsar or Kafka (the WAL for recent writes). The good news is that since Milvus 2.4 the platform ships a first-class milvus-backup tool that handles all three in one command. The bad news is that the tool is not a single binary you drop in: it needs network access to all three storage layers and a config file with the right connection strings.
- milvus-backup — Open source, maintained by the Milvus team, supports full and incremental backups, can restore to a different Milvus cluster, and can be pointed at any S3-compatible storage. The CLI is
milvus-backup createandmilvus-backup restore. - Granularity — Database-level, collection-level, or partition-level backup. For multi-tenant Milvus this is the right granularity knob.
- Cross-version restore — A backup made on Milvus 2.3 will restore to 2.4, but you must check the release notes. The Milvus team publishes a compatibility matrix; assume no restore will work across a major version bump without re-embedding.
- Standby cluster pattern — For tight RPO, the standard pattern is a standby Milvus cluster that tails the primary's WAL (via Pulsar) and a periodic
milvus-backupsync to handle the case where the WAL replays go out of sync. This is a multi-day setup the first time, but it gets you to single-digit-minute RPO.
My take on Milvus backups: Milvus is the most operationally demanding of the five engines, but if you need fine-grained backup control (cross-region restore, partition-level rollback, multi-tenant isolation) it is also the most capable. For a small team without dedicated platform engineers, the operational tax is too high and I would steer you toward Qdrant or Weaviate.
Chroma: Lightweight, But Bring Your Own Backup Strategy
Chroma is the odd one out. It is a single-node embedded database in its default deployment, with no native snapshot or backup API. The "backup" is whatever you do to the chroma directory on disk:
- Filesystem snapshot — A consistent
tarof the Chroma data directory while the process is briefly quiesced. This works but is a stop-the-world operation; for a Chroma instance serving traffic, that means a maintenance window. - SQLite mode (Chroma 0.5+) — If you run Chroma with the SQLite backend (the default for small deployments), you can use
sqlite3 .backupfor an online consistent backup. This is the right answer for Chroma instances under ~5 GB. - ClickHouse mode — For larger Chroma deployments, ClickHouse handles the data and you can use ClickHouse's native backup tooling. Same operational story as any other ClickHouse instance.
Chroma is fine for prototyping and small production workloads. If your RAG corpus is growing past 5 million vectors or you need multi-node, move to Weaviate or Qdrant — Chroma's backup story does not scale, and the operational team will outgrow it faster than you expect.
Concrete Runbook: Backup Strategy for a 50-Million-Vector RAG Cluster
This is the runbook I ship to customers running a mid-size RAG system. It assumes a self-hosted deployment on Kubernetes with S3 as the durable store.
- Hourly: Enable incremental backups (Weaviate) or per-node snapshot cron (Qdrant) or
milvus-backupincremental (Milvus). Push artifacts to an S3 bucket with versioning enabled and a 30-day lifecycle policy. - Daily: Trigger a full backup at 02:00 UTC when query load is lowest. Verify the backup size matches the expected range (a 10% size delta is normal; a 50% delta means something is wrong with the index compaction). Send a Slack notification with the backup size, duration, and verification status.
- Weekly: Run a restore dry-run into a fresh namespace in the same cluster. Time the restore end-to-end and store the result in a runbook. If the restore time regresses by more than 20%, file a ticket — usually means index fragmentation or a schema change that broke the index build path.
- Monthly: Test a cross-region restore. Pick a fresh AWS region (or GCP / Azure, if you are multi-cloud), bring up a fresh cluster, and restore the most recent backup. Time it. This is the only way to know your restore actually works in a real disaster, not just in CI.
- Quarterly: Test a full disaster recovery drill: delete the primary cluster and bring up from backup in a new region. Practice the actual incident response, not just the technical restore.
The 20% restore-time alert is a heuristic, not a hard rule. On our production cluster, restores take 40-55 minutes and have been stable within 10% for 8 months. A sudden jump almost always means someone added a new collection without telling the platform team, and the new collection has a much larger index than expected.
What NOT to Do
Three patterns I have seen break in production, in order of how often they bite:
- Backing up the running index files directly. Every vector database will corrupt an index if you copy the live index files while writes are happening. Use the engine's snapshot API, which quiesces writes (or uses copy-on-write) for the duration of the snapshot.
- Restoring to a different major version. HNSW and IVF index file formats change between major versions. The engine will either refuse the restore or silently corrupt the index. Always restore to the same major version, then upgrade in place.
- Forgetting to back up the schema and tenant mappings. The vectors are useless if you cannot tell which collection and tenant they belong to. Most engines back up the collection name with the vectors, but tenant IDs and alias mappings are easy to miss.
One more: do not skip the restore drill. A backup you have never restored is a backup you do not have. The first time you find out your backup is broken is when you need it most, and that is the worst time to learn.
Open source vector database with first-class cloud-native backups
FAQ
Should I back up my vector database or rebuild from source?
For most RAG workloads, rebuild from source is the right call: re-embedding a few million documents is cheap (GPU hours, not days), and the source corpus in S3 is your real source of truth. Native backups are the right call when RPO must be small, when re-embedding is too expensive (huge corpora, expensive embedding models), or when you have compliance requirements that forbid re-derivation. The decision should be made on a per-cluster basis, not as a blanket policy.
How do I size the S3 bucket for backups?
For Weaviate incremental backups on a 50-million-vector index, expect the daily backup to be 8-15% of the index size. With 30-day retention in S3 standard and 90-day in S3 IA, the storage cost is roughly 4-5x the index size in S3 over the full retention window. For a 100 GB index, that is 400-500 GB of S3 storage.
What is a reasonable RTO for a vector database?
For a small (under 10M vector) cluster, 30 minutes is achievable. For mid-size (10-100M vector), 1-2 hours is the realistic floor. For large (100M+ vector) clusters, plan for 4-8 hours unless you are running a warm standby. The index rebuild is the bottleneck — vector ingestion is parallelizable, but the HNSW graph build is single-threaded per shard.