From Forza's ramblings

Deduplication with Duperemove

Bricks of red clay
Some stacked and fallen bricks

Extents are the blocks of data that files consists of in a Btrfs filesystem. Duperemove is a tool for finding duplicate extents in files and can submit them to the Linux kernel for deduplication.

It is possible to deduplicate identical blocks between several files as well as between different extents within a single file.

An advanced example that I often use myself:

chrt -i 0 duperemove -A -h -d -r -v -b128k --dedupe-options=noblock,same --lookup-extents=yes --io-threads=1

Here I'm using chrt to change the scheduling class of duperemove to idle so that it only uses spare CPU time, easing the load. io-threads=1 minimises disk thrashing.

Command Line

Basic usage is to simply invoke: duperemove <some files>. This will do a read-only scan and output the results. This will not do any actual deduplication.

# duperemove /media/vm/libvirt/Mint_*.qcow2
Gathering file list...
Adding files from database for hashing.
Using 4 threads for file hashing phase
[1/2]  (50.00%) csum: /media/vm/libvirt/images/Mint_Cinnamon.qcow2
[2/2] (100.00%) csum: /media/vm/libvirt/images/Mint_Mate.qcow2
Total files:  2
Total extent hashes: 5700
Loading only duplicated hashes from hashfile.
Found 97 identical extents.
Simple read and compare of file data found 13 instances of extents that might benefit from deduplication.

Showing 63 identical extents of length 65536 with id 29af5479
Start          Filename
294518784      "/media/vm/libvirt/images/Mint_Cinnamon.qcow2"
7408582656     "/media/vm/libvirt/images/Mint_Cinnamon.qcow2"
7409631232     "/media/vm/libvirt/images/Mint_Cinnamon.qcow2"
7411007488     "/media/vm/libvirt/images/Mint_Cinnamon.qcow2"
7415267328     "/media/vm/libvirt/images/Mint_Cinnamon.qcow2"

...cut for brevity

12852658176     "/media/vm/libvirt/images/Mint_Cinnamon.qcow2"
12853051392     "/media/vm/libvirt/images/Mint_Cinnamon.qcow2"

Showing 3 identical extents of length 524288 with id 1fc1958f
Start           Filename
12581863424     "/media/vm/libvirt/images/Mint_Cinnamon.qcow2"
10637017088     "/media/vm/libvirt/images/Mint_Mate.qcow2"
10931208192     "/media/vm/libvirt/images/Mint_Mate.qcow2"

Showing 2 identical extents of length 524288 with id 6e892bbd
Start           Filename
2952658944      "/media/vm/libvirt/images/Mint_Cinnamon.qcow2"
2956066816      "/media/vm/libvirt/images/Mint_Mate.qcow2"

To do the actual deduplication of your files you need to pass the -d option. Here I'm deduping the root subvolume of a virtual machine.

# duperemove -d -h -r /mnt/rootvol/@
Using 4 threads for dedupe phase
[0x562ddba99860] (001/274) Try to dedupe extents with id 485b1fab
[0x562ddba99860] Dedupe 1 extents (id: 485b1fab) with target: (128.0K, 2.8K), "/mnt/rootvol/@/usr/lib/modules/5.8.0-59-generic/kernel/drivers/staging/wilc1000/wilc1000.ko"
[0x562ddba99800] (002/274) Try to dedupe extents with id 296010f4
[0x562ddba99800] Dedupe 1 extents (id: 296010f4) with target: (256.0K, 8.8K), "/mnt/rootvol/@/usr/lib/modules/5.8.0-59-generic/kernel/drivers/staging/media/ipu3/ipu3-imgu.ko"
[0x562ddba99360] (003/274) Try to dedupe extents with id 8c91e19c
[0x562ddba99360] Dedupe 1 extents (id: 8c91e19c) with target: (128.0K, 22.5K), "/mnt/rootvol/@/usr/lib/modules/5.8.0-59-generic/kernel/drivers/staging/qlge/qlge.ko"
[0x562ddba990c0] (004/274) Try to dedupe extents with id 527a046f
[0x562ddba990c0] Dedupe 1 extents (id: 527a046f) with target: (2.1M, 10.7K), "/mnt/rootvol/@/usr/lib/firmware/netronome/bpf/nic_AMDA0099-0001_2x10.nffw"
[0x562ddba990c0] (005/274) Try to dedupe extents with id 4a47103d
[0x562ddba990c0] Dedupe 1 extents (id: 4a47103d) with target: (128.0K, 30.1K), "/mnt/rootvol/@/var/lib/aspell/en-wo_accents-only.rws"
[0x562ddba990c0] (006/274) Try to dedupe extents with id 2a3dd103
[0x562ddba990c0] Dedupe 1 extents (id: 2a3dd103) with target: (256.0K, 30.4K), "/mnt/rootvol/@/usr/share/m2300w/0.51/psfiles/CHP410-1200-Photo.crd"
[0x562ddba990c0] (007/274) Try to dedupe extents with id 2970448c
[0x562ddba99800] (008/274) Try to dedupe extents with id 8a0d865b
[0x562ddba99360] (009/274) Try to dedupe extents with id 6ab91b2a

...snip for brevity...

[0x562ddba990c0] (271/274) Try to dedupe extents with id c56ef4a7
[0x562ddba990c0] Dedupe 1 extents (id: c56ef4a7) with target: (24.1M, 4.0M), "/mnt/rootvol/@/var/log/journal/939972095cf1459c8b22cc608eff85da/system.journal"
[0x562ddba99800] (272/274) Try to dedupe extents with id fa0f104f
[0x562ddba99800] Dedupe 1 extents (id: fa0f104f) with target: (0.0, 4.4M), "/mnt/rootvol/@/boot/initrd.img-5.8.0-55-generic"
[0x562ddba99860] (273/274) Try to dedupe extents with id 22e16fe0
[0x562ddba99860] Dedupe 1 extents (id: 22e16fe0) with target: (24.1M, 4.5M), "/mnt/rootvol/@/var/log/journal/939972095cf1459c8b22cc608eff85da/system.journal"
[0x562ddba990c0] (274/274) Try to dedupe extents with id 56b16eb7
[0x562ddba990c0] Dedupe 2 extents (id: 56b16eb7) with target: (0.0, 6.5M), "/mnt/rootvol/@/var/lib/aptitude/pkgstates.old"

Comparison of extent info shows a net change in shared extents of: 39.5M

Useful command line options

Option Description
-b size Select size of blocks to dedupe. A small size increases the possibility of Deduplication, but increases memory usage of duperemove. Smallest is 4k. Default is 128k.
-h Prints sizes in human readable form using KiB, MiB, GiB, instead of bytes.
-r Runs duperemove recursively into subdirectories.
-A Opens files in read-only mode. Makes it possible to run duperenove on read-only snapshots. Requires sudo/root privileges.
--io-threads=N Use N threads for I/O. This is used by the file hashing and dedupe stages. Default is based on number of current cpus. On spinning media it might be useful to limit io-threads to 4 or less.
--lookup-extents=yes Allows duperemove to skip checksumming some blocks by checking their extent state. Defaults to no.
--hashfile=hashfile Use a file for storage of hashes instead of memory. This option drastically reduces the memory footprint of duperemove and is recommended when your data set is more than a few files. Hashfiles are also reusable, which speeds up subsequent runs of duperemove.
--fdupes Run in fdupes mode. With this option you can pipe the output of fdupes to duperemove to dedupe any duplicate files found. When receiving a file list in this manner, duperemove will skip the hashing phase.
--dedupe-options=options, a comma separated list of dedupe options. Prepend no to an option in order to turn it off.
noblock Defaults to on (block). Duperemove can optionally optimize the duplicate block lists into larger extents prior to dedupe submission. The search algorithm used for this however has a very high memory and cpu overhead, but may reduce the number of extent references created during dedupe.
same Allows deduplication of extents within the same file.
partial Allows deduplication of partial extents. This can increase the success of duperemove but uses more CPU and is therefore not used by default. It is under active development so the effects of this flag could change in the future.
nofiemap Defaults to on (fiemap). Duperemove uses the fiemap ioctl during the dedupe stage to optimize out already deduped extents as well as to provide an estimate of the space saved after dedupe operations are complete.

Unfortunately, in some circumstances the fiemap ioctl slow down as the number of references on a file extent goes up. If you are experiencing the dedupe phase slowing down or 'locking up' this option may give you a significant amount of performance back.

Note: This does not turn off all useage of fiemap, to disable fiemap during the file scan stage, you will also want to use the --lookup-extents=no option.

Full set of options can be found in the duperemove manual