Btrfs/Deduplication/Bees
Deduplication with Bees[edit | edit source]

Bees is perhaps the most unique Deduplication tool as it is specifically made to work only with Btrfs filesystems. It is different in several ways:
- It uses a very lightweight database (hash table) to keep track of deduplication progress. You can stop and restart and it continues from where it left off.
- Memory usage is fixed and never grows beyond database size and a small amount for the bees runtime, whereas most other dedup tools grow to multi gigabyte RAM usage)
- It runs continuously in the background, deduplicating any newly written data.
- It runs only on whole filesystems, not on a limited set of files.
- Bees can split extents to achieve better deduplication results.
Make sure you have a recent kernel. Some kernel versions has bugs triggered by Bees, so please check https://github.com/Zygo/bees/blob/master/docs/btrfs-kernel.md before continuing.
Configuration[edit | edit source]
First you need to determine the UUID of the filesystem you want to run Bees on. Use btrfs filesystem show
to list all Btrfs filesystems with their UUID's.
# btrfs filesystem show
Label: 'btrfs-root' uuid: 446d32cb-a6da-45f0-9246-1483ad3420e0 Total devices 1 FS bytes used 35.04GiB devid 1 size 229.47GiB used 79.03GiB path /dev/sda3 Label: '6TB' uuid: fe0a1142-51ab-4181-b635-adbf9f4ea6e6 Total devices 2 FS bytes used 3.48TiB devid 2 size 2.72TiB used 2.21TiB path /dev/sdc2 devid 3 size 1.82TiB used 1.30TiB path /dev/sdb2 Label: 'btrfs-boot' uuid: 1128e72e-b00f-4c2a-a1e1-afa89f3c11cc Total devices 1 FS bytes used 70.55MiB devid 1 size 1.00GiB used 256.00MiB path /dev/sda2 Label: 'usb-backup' uuid: df68a30d-d26e-4b9c-9606-a130e66ce63d Total devices 1 FS bytes used 581.47GiB devid 1 size 927.51GiB used 591.02GiB path /dev/sdd1
We'll use the fe0a1142-51ab-4181-b635-adbf9f4ea6e6
in this guide.
Now we can create the Bees configuration file /etc/bees/6TB.conf
. You can use any name on the file with a .conf
file extension.
# /etc/bees/6TB.conf
# UUID of the filesystem UUID=fe0a1142-51ab-4181-b635-adbf9f4ea6e6 # Specify the bees database size. It has to be a multiple of 128KiB DB_SIZE=$((256*1024*1024)) # 256MiB in bytes
The database size determines how efficient Bees will be on your filesystem. The database has to fit in RAM, so make sure you have enough RAM available to run Bees with the selected database size.
Unique data size | Database size | Average dedupe extent size |
---|---|---|
1TiB | 4GiB | 4KiB |
1TiB | 1GiB | 16KiB |
1TiB | 256MiB | 64KiB |
1TiB | 128MiB | 128KiB <- recommended |
1TiB | 16MiB | 1024KiB |
64TiB | 1GiB | 1024KiB |
Running Bees[edit | edit source]
Once you have the configuration set up we can run the Bees daemon beesd <uuid>
# beesd fe0a1142-51ab-4181-b635-adbf9f4ea6e6
I recommend to run bees inside a screen terminal or as a daemon. Bees comes with a systemd unit file: https://github.com/Zygo/bees/tree/master/scripts
Bees Stats[edit | edit source]
Bees will store various statistics and other files:
Path | Contents |
---|---|
/run/bees/uuid.status | Current running stats of bees. |
/run/bees/mnt/uuid/.beeshome/beescrawl.dat | Bees crawler stats |
/run/bees/mnt/uuid/.beeshome/beeshash.dat | Bees database |
/run/bees/mnt/uuid/.beeshome/beesstats.txt | Bees statistics, database usage. |