Blog/Duperemove

From Forza's ramblings

2021-07-13: Duperemove with Btrfs[edit | edit source]

I've just updated my Deduplication guide with Duperemove. It is a tool that can deduplicate blocks of data between files on Linux filesystems that supports the Linux kernel fideduperange (previously Btrfs specific extent-same ioctl). Currently Btrfs and XFS supports this.

Head over to the articles over at https://wiki.tnonline.net/w/Btrfs/Deduplication and https://wiki.tnonline.net/w/Btrfs/Deduplication/Duperemove

The following example is what I often use myself:

chrt -i 0 duperemove -A -h -d -r -v -b4k --dedupe-options=noblock,same --lookup-extents=yes --io-threads=1

Adjust the block size to suit your needs. It is good to start with 64-128k.

chrt is used to change the scheduling class of a process. It is similar to nice and renice, but changes the scheduling class instead of priority within a class. I am using chrt --idle 0 to let duperemove only using spare cpu cycles so that it doesn't slow down the system too much.

chrt command line options
-o, --other
   Set scheduling policy to SCHED_OTHER (time-sharing scheduling). This is the default Linux scheduling policy.

-f, --fifo
   Set scheduling policy to SCHED_FIFO (first in-first out).

-r, --rr
   Set scheduling policy to SCHED_RR (round-robin scheduling). When no policy is defined, the SCHED_RR is used as the default.

-b, --batch
   Set scheduling policy to SCHED_BATCH (scheduling batch processes). The priority argument has to be set to zero.

-i, --idle
   Set scheduling policy to SCHED_IDLE (scheduling very low priority jobs). The priority argument has to be set to zero.

-d, --deadline
   Set scheduling policy to SCHED_DEADLINE (sporadic task model deadline scheduling).

   The priority argument has to be set to zero. See also --sched-runtime, --sched-deadline and --sched-period. The relation between the options required by the kernel is runtime ⇐ deadline ⇐ period. chrt copies period to deadline if --sched-deadline is not specified and deadline to runtime if --sched-runtime is not specified. It means that at least --sched-period has to be specified. See sched(7) for more details.