Btrfs/Space Cache

From Forza's ramblings

Improving Copy-on-Write performance

Picture of a squirrel.
Red squirrel (Sciurus vulgaris) searching for cached nuts.

Copy-on-Write (CoW) means that the filesystem never overwrites existing data in-place, but instead it writes new data on empty locations. Once the data is written, Btrfs updates the metadata trees to point to the newly written data. This means if there is a crash or power outage, the filesystem will remain intact and no data will remain "half-written" - the filesystem will either have all the old data or all the new data depending on when the crash occurred.

Because of the need to quickly find empty areas to write to, Btrfs keeps track on where on-disk there is free space to write new data. Btrfs solves this by creating a special cache of all the free space on the filesystem. Without this cache, the write performance would suffer greatly.

Space Cache and the Free Space Tree (space_cache=v2)

There are two versions of the Space Cache, The original v1 and then the new modern v2, which is called free space tree.

On large filesystems (many terabytes) and with certain workloads, the performance of the v1 space cache may degrade drastically. This is why the new v2 implementation was created. It uses a new B-tree called the free space tree.

Starting with btrfs-progs 5.15, the free space tree is the default for all newly created filesystems.

Space cache is a filesystem wide option and it is not possible to change this per-subvolume.

How to switch to the Free Space Tree

It is possible to change from Space Cache (v1) to the Free Space Tree (v2).

In order to switch, you have to unmount the filesystem, remove the previous space cache, then mount it with the space_cache=v2 mount option. This will be a permanent change and Btrfs will use the new Free Space Tree on all future mounts once enabled.

Removing the old v1 space cache is done with btrfs check.

# umount /mnt/btrfs 
# btrfs check --clear-space-cache v1 /dev/device
# mount /dev/device /mnt/btrfs -o space_cache=v2

It is enough to run btrfs check --clear-space-cache v1 /dev/device on one of the disks in a multiple disk filesystem.

IMPORTANT! On very large filesystems, the first mount after changing Space Cache can take a long time. Usually several minutes, but there are reports of an hour or more for extreme cases with massive filesystems.

If you want to change your root filesystem you have to boot with a Live USB stick to be able to mount the filesystem, using the above process. The Fedora Workstation Live DVD is a good choice since it has up-to-date Linux kernels and btrfs-progs.

Note: It is possible to change the kernel rootflags in the boot loader to rw,space_cache=v2, but it could leave you with an unbootable system if there is a problem or mistake.

Switching back to Space Cache from Free Space Tree can be done in the same was as before:

# umount /mnt/btrfs 
# btrfs check --clear-space-cache v2 /dev/device
# mount /dev/device /mnt/btrfs -o space_cache=v1

How to enable the Free Space Tree at mkfs time

With btrfs-progs 5.15 and newer, the Free Space Tree is default. There is no need to specify any special options to take advantage of the new cache.

With btrfs-progs 5.7 and later, there is the -R free-space-tree option to enable Free Space Tree as default for new filesystems at mkfs time. With this option, the kernel will automatically use the Free Space Tree without needing any mount options. There is also no need to clear any old Space Cache since it never gets created.

# mkfs.btrfs -R free-space-tree -L my-btrfs /dev/vdb1
btrfs-progs v5.13.1 
See http://btrfs.wiki.kernel.org for more information.

Label:              my-btrfs
UUID:               80014f0a-dd1d-4f09-ab34-7adeb27585df
Node size:          16384
Sector size:        4096
Filesystem size:    512.00GiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         DUP               1.00GiB
  System:           DUP               8.00MiB
SSD detected:       no
Zoned device:       no
Incompat features:  extref, skinny-metadata
Runtime features:   free-space-tree 
Checksum:           crc32c
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1   512.00GiB  /dev/vdb1