Btrfs/Deduplication

From Forza's ramblings

Reflinks[edit | edit source]

A rather unique feature of Btrfs is the concept of cloning files in an atomic way. This usually called a reflink.

This allows the user to make an instant copy of a file, similar to a hard link. When the original file or the copy is modified, Copy-on-Write (CoW), ensures that the files remain unique from each other.

File cloning (reflink, copy-on-write) via cp:

cp --reflink <source file> <destination file>

Tip: Put an alias in your .bash_profile or /etc/profile.d/ for cp to always do reflinks. Blog/Bash Aliases.

Deduplication[edit | edit source]

Deduplication means to two take two or more files and join equal parts as reflinked copies. If one of the files is changed, CoW makes sure that the file remain unique from each other. Smallest possible part of a file to deduplicate is 4KiB.

There are several tools that can do deduplication on Btrfs. They either operate on whole files or on individual filesystem blocks. Block based deduplication can be more efficient as it can match parts of files that are equal, but the downside is that the deduplication process can be slower.

Name File-based Block-based Incremental Notes
duperemove Yes No Yes Sqlite database for csum. Runs by extent boundary by default, but has an option to more carefully compare.
bees No Yes Yes Runs continuously as a daemon. Uses a database to track deduplication, useful for large storage like backup servers.
dduper Yes Yes No Uses built-in BTRFS csum-tree, so is extremely fast and lightweight. Requires BTRFS-PROGS patch for csum access. Be careful of hash-collisions on large filesystems!
rmlint Yes No Partial A duplicate file finder with btrfs support.
dduper Yes No No A fork of fdupes which includes support for BTRFS deduplication when it identifies duplicate files.

Bees[edit | edit source]

Bees is perhaps the most specific Btrfs tool as it is specifically made to work only with Btrfs filesystems. It is different in several ways:

  1. It uses a very lightweight database (hash table) to keep track of deduplication progress. You can stop and restart and it continues from where it left off.
  2. Memory usage is fixed and never grows beyond database size and a small amount for the bees runtime, whereas most other dedup tools grow to multi gigabyte RAM usage)
  3. It runs continuously in the background, deduplicating any newly written data.
  4. It runs only on whole filesystems, not on a limited set of files.
  5. Bees can split extents to achieve better deduplication results.

The manual is a little awkward for non-bees people. Have a look at Btrfs/Deduplication/Bees for a quick user-guide to get started.