CDC Chunking
Use CDC for large files that are frequently modified locally
What CDC Is¶
CDC stands for Content-Defined Chunking.
It is a chunking technique that decides chunk boundaries from the file content itself instead of fixed byte offsets.
Because of that, when a large file is only changed locally, such as appending at the end or editing a small region in the middle, many unchanged parts can still be cut into the same chunks as before. Those chunks keep the same hash and can be reused across backups, which improves deduplication
In Prime Backup, restoring a chunked file is transparent to users. The original file is reconstructed automatically when the backup is read or exported
When It Is Applied¶
CDC chunking is used only when all of the following conditions are true:
backup.cdc_enabledistrue- the file size is greater than
0 - the file size is at least
backup.cdc_file_size_threshold - the file path relative to
backup.source_rootmatchesbackup.cdc_patterns
The default configuration is:
{
"cdc_enabled": false,
"cdc_file_size_threshold": 104857600,
"cdc_patterns": [
"**/*.db"
]
}
How It Is Stored¶
Prime Backup still creates one blob record for the whole file, but the blob uses the chunked storage method instead of direct.
The current implementation works in the following order:
- Cut the file with FastCDC
- Calculate a BLAKE3 hash for each chunk
- Reuse chunks that already exist in storage
- Compress and write only the new chunks
- Bind the ordered chunk list back to the whole-file blob (by offset)
The current chunking parameters are fixed in code:
- average chunk size:
256 KiB - minimum chunk size:
64 KiB - maximum chunk size:
1 MiB
Metadata Optimization (Chunk Groups)¶
Conceptually, a chunked blob is just an ordered list of chunks. Storing a direct binding row for every blob-chunk pair would be expensive, so the implementation groups consecutive chunks into chunk groups and stores two bindings:
- blob -> chunk group (by blob offset)
- chunk group -> chunk (by group offset)
+--------------------------------------------------------------------------------+
| blob |
+--------------------------+-----------------------------------------------------+
| chunk group 1 | chunk group 2 | chunk group 3 |
+--------------------------+-----------------+-----------------+-----------------+
| chunk1 | chunk2 | chunk3 | chunk4 | chunk5 | chunk6 | chunk7 | chunk8 | chunk9 |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
This reduces metadata overhead without changing the logical model
Chunk hashes and chunk group hashes always use blake3, while the whole-file blob hash still follows backup.hash_method
Compression and Performance¶
CDC chunking does not disable compression
For a chunked blob:
- the blob record itself uses
plainas its own compression marker - each chunk is compressed independently according to
backup.compress_methodandbackup.compress_threshold - the blob
stored_sizeis the sum of unique stored chunk sizes
Compared with direct blob storage, CDC is slower in the current implementation. Before new data is written, Prime Backup needs extra work to cut the file, calculate hashes, and read chunk contents again for chunk creation and verification
Because of that, CDC is best reserved for files that are large, frequently modified locally, and worth backing up
Good Candidates¶
CDC is a good fit when most backups only change part of a large file, for example:
- large database files with local updates
- large log files that are appended at the end and need to be backed up
- files that are often modified by local insertion, deletion, or small-range updates
Poor Candidates¶
CDC is usually not a good fit when the whole file changes almost everywhere in each backup, for example:
- files that are rewritten completely every time
- exported artifacts whose entire content layout changes on each generation
- already compressed or encrypted files with poor chunk reuse after edits
Also note that the first backup containing a file still needs to write all chunks, so CDC mainly pays off on later backups with high chunk reuse
If Prime Backup finds that one chunked file produced many brand new chunks, it will emit a warning in logs. That usually means the file is not a good CDC target, unless this is the first backup containing that file
Dependencies and Observation¶
CDC chunking requires the optional Python dependency pyfastcdc.
You can install it directly, or install the optional dependency bundle from requirements.optional.txt.
Prime Backup maintenance logic already understands chunked storage.
You can inspect the effect with !!pb database overview, which includes a dedicated chunk statistics section