A rather unique feature of Btrfs is the concept of cloning files in an atomic way. This usually called a reflink.
This allows the user to make an instant copy of a file, even if it's several GiB large. It is similar to a hard link, but with a big advantage that the two files remain separate. When the original file or the copy is modified, Copy-on-Write (CoW), ensures that the files remain unique from each other.
File cloning (reflink, copy-on-write) via cp:
# cp --reflink <source file> <destination file>
Reflinking files can cross subvolumes on the same filesystem but not cross between different filesystems.
A limitation with the Linux VFS layer is that in order for reflinks to work between subvolumes, they have to be visible from the same mount point.
Below is a list of of mount-points. The mounts in /media/xxx are subvolumes of the same filesystem on /dev/sdb2. The top-level volume is mounted as
/mnt/archive and its subvolumes are listed in
List of mount points
Filesystem Size Used Avail Use% Mounted on /dev/sda3 230G 61G 160G 28% / /dev/sdb2 19T 11T 7.6T 59% /media/filehistory /dev/sdb2 19T 11T 7.6T 59% /media/downloads /dev/sdb2 19T 11T 7.6T 59% /media/userData /dev/sdb2 19T 11T 7.6T 59% /media/vm /dev/sdb2 19T 11T 7.6T 59% /mnt/archive
# cp --reflink=always /media/download/archlinux.iso /media/userData/Linux
This will fail because the subvolumes are not directly visible in a common Parent. The directories
/mediaare from different mount points.
# cp --reflink=always /mnt/archive/volume/download/archlinix.iso /mnt/archive/volume/userData/Linux
This will work because they have a parent
/mnt/archivein the same mount point.
Deduplication means to two take two or more files and join equal parts as reflinked copies. If one of the files is changed, CoW makes sure that the file remain unique from each other. Smallest possible part of a file to deduplicate is 4KiB.
There are several tools that can do deduplication on Btrfs. They either operate on whole files or on individual filesystem blocks. Block based deduplication can be more efficient as it can match parts of files that are equal, but the downside is that the deduplication process can be slower.
|duperemove||Yes||No||Yes||Sqlite database for csum. Runs by extent boundary by default, but has an option to more carefully compare.|
|bees||No||Yes||Yes||Runs continuously as a daemon. Uses a database to track deduplication, useful for large storage like backup servers.|
|dduper||Yes||Yes||No||Uses built-in BTRFS csum-tree, so is extremely fast and lightweight. Requires BTRFS-PROGS patch for csum access. Be careful of hash-collisions on large filesystems!|
|rmlint||Yes||No||Partial||A duplicate file finder with btrfs support.|
|jdupes||Yes||No||No||A fork of fdupes which includes support for BTRFS deduplication when it identifies duplicate files.|
Bees is perhaps the most specific Btrfs tool as it is specifically made to work only with Btrfs filesystems. It is different in several ways:
- It uses a very lightweight database (hash table) to keep track of deduplication progress. You can stop and restart and it continues from where it left off.
- Memory usage is fixed and never grows beyond database size and a small amount for the bees runtime, whereas most other dedup tools grow to multi gigabyte RAM usage)
- It runs continuously in the background, deduplicating any newly written data.
- It runs only on whole filesystems, not on a limited set of files.
- Bees can split extents to achieve better deduplication results.
The manual is a little awkward for non-bees people. Have a look at Btrfs/Deduplication/Bees for a quick user-guide to get started.
Duperemove works on individual file extents similar to Bees, but it operates on individual files instead of the whole filesystem. It scans each file, calculates block hashes and submits duplicates to the Linux kernel for deduplication.
One big advantage is that duperemove can store calculated hashes in a database. This makes it possible to re-run duperemove without having to scan all old files, while it will detect new files and add to the database. This greatly reduces the time for each run.
Duperemove can also take input from the fdupes program, avoiding the need to do any hash checking at all.
For some examples on how to use Duperemove you can head over to my user-guide at Btrfs/Deduplication/Duperemove
The official web page for Duperemove is at http://markfasheh.github.io/duperemove/