Btrfs/Allocator Hints

From Forza's ramblings
An image of the Sombrero galaxy (Messier 104) taken by the James Webb Space Telescope using the MIRI instrument. The galaxy appears as a smooth, elongated disk with a faintly glowing centre. Surrounding the core is a clumpy outer ring, rich in mid-infrared detail. The background is scattered with galaxies of various shapes and colours, indicating different distances and properties. The galaxy, located in the Virgo constellation, spans roughly 30 million light-years from Earth.
This James Webb Space Telescope image showcases the Sombrero galaxy (Messier 104) in infrared light. The galaxy's distinctive glowing core from visible light is replaced by a smooth inner disk in this view, while its outer ring reveals intricate clumps of dust. The galaxy, located 30 million light-years away in Virgo, is home to a docile supermassive black hole.

Allocator Hints for Btrfs[edit | edit source]

Allocator hints were introduced in a series of patches for Btrfs, allowing users to configure the chunk allocator to prioritise specific devices for metadata or data allocation. The idea is to optimise mixed-device setups, such as SSDs and HDDs in a single filesystem by using faster disks for the latency sensitive metadata, while storing the bulk data on slower devices.

Various versions of the patches were submitted to the Linux Btrfs mailing list over the years. For example the preferred_metadata patches in 2020 and allocation hint in 2022 by Goffredo Baroncelli. Since then, the user kakra has collected and maintained these patches, ensuring they work with newer kernels while adding fixes and enhancements. Kakra's work is available on GitHub: Allocator Hint Patch Pull Request.

Benefits of Allocator Hints[edit | edit source]

The allocator hints patch provides several advantages:

  1. Performance Improvements: By allocating metadata to SSD or NVMe devices, users can greatly improve filesystem responsiveness, as metadata operations are typically small and random.
  2. Efficient Storage Usage: Allocating data to HDDs ensures that their large capacity is utilised effectively.
  3. Compatibility: Filesystems remain compatible with non-patched kernels, though allocation preferences will not be respected without the patch.

This feature is ideal for setups combining NVMe and SSDs or SSDs with HDDs, offering a way to balance performance and cost-efficiency.

How to Enable Allocator Hints[edit | edit source]

To use allocator hints, you need a patched kernel. Follow these steps:

Applying the Patch to Kernel Sources[edit | edit source]

Compiling your own kernel is not very difficult. However, each Linux distribution has their own methods for building their kernels. The steps below are just the generic steps. Please look up how it is done for your specific distribution.

# patch -p1 < ../btrfs_allocator_hints-6.12_v1.patch
patching file include/uapi/linux/btrfs_tree.h
patching file fs/btrfs/sysfs.c
patching file fs/btrfs/sysfs.c
patching file fs/btrfs/volumes.c
patching file fs/btrfs/volumes.h
patching file fs/btrfs/volumes.c
patching file fs/btrfs/volumes.h
patching file fs/btrfs/volumes.c
patching file include/uapi/linux/btrfs_tree.h
  • Rebuild and install the kernel.
  • Reboot into the patched kernel.

Once running the patched kernel, use the allocator hints as described below.

Configuring Allocator Hints[edit | edit source]

Every Btrfs filesystem has a unique UUID and each device in the filesystem is identified by a device ID number.

Setting a allocation hint is done by writing the allocation hint type for each device to its type file, /sys/fs/btrfs/<uuid>/devinfo/<id>/type.

There are 6 different hints (types 0-5) that can be set.

Allocator Hint Types
Type Description Recommended Use
0 Prefer writing data to this device. Btrfs will prioritise allocating data chunks from this device before considering others. Recommended for HDDs. This is the default setting.
1 Prefer writing metadata to this device. Btrfs will prioritise allocating metadata chunks from this device before considering others. Recommended for SSDs.
2 Write metadata only to this device. Not recommended; can lead to early no-space situations.
3 Write data only to this device. Not recommended; can lead to early no-space situations.
4 Avoid allocating new chunks to this device. Useful if planning to remove the device from the filesystem in the future. Use for devices you plan to decommission or remove.
5 Prevent allocating new chunks to this device. Useful if you plan on removing multiple devices from the pool in parallel Use for devices you plan to decommission or remove.
Note: Types 0 and 1 set a preference, meaning Btrfs will prioritise these devices but can still allocate chunks to others if needed. Types 2 and 3 enforce an exclusive allocation, restricting data or metadata entirely to the specified device, which can lead to early no-space (ENOSPC) errors.

Identify your device IDs by running btrfs device show:

# btrfs device show /media/backup
Label: '3t-backup'  uuid: aa358efb-ce43-498c-9997-0d35ba13261f
        Total devices 3 FS bytes used 1.68TiB
        devid    1 size 2.72TiB used 1.97TiB path /dev/mapper/3t_backup
        devid    2 size 50.00GiB used 38.03GiB path /dev/mapper/vg_800g-3TB_meta1
        devid    3 size 50.00GiB used 38.03GiB path /dev/mapper/vg_800g-3TB_meta2

Note the IDs of each device. In this example, ID 1 is a HDD and IDs 2 and 3 are SSDs.

To set data preference to the HDD and metadata preference to the SSDs, simply write 0 and 1 to the corresponding sysfs file:

echo 0 > /sys/fs/btrfs/aa358efb-ce43-498c-9997-0d35ba13261f/devinfo/1/type
echo 1 > /sys/fs/btrfs/aa358efb-ce43-498c-9997-0d35ba13261f/devinfo/2/type
echo 1 > /sys/fs/btrfs/aa358efb-ce43-498c-9997-0d35ba13261f/devinfo/3/type

Setting or changing the allication hint will only affect allocation of new chunks. Existing chunks have to be balanced to take advantage of the new hints.

Use grep to list the current configuration for all devices:

# grep . /sys/fs/btrfs/*/devinfo/*/type
/sys/fs/btrfs/aa358efb-ce43-498c-9997-0d35ba13261f/devinfo/1/type:0x00000000
/sys/fs/btrfs/aa358efb-ce43-498c-9997-0d35ba13261f/devinfo/2/type:0x00000001
/sys/fs/btrfs/aa358efb-ce43-498c-9997-0d35ba13261f/devinfo/3/type:0x00000001
/sys/fs/btrfs/c08bb98b-3b98-4dbb-a7c0-5540c2af781b/devinfo/1/type:0x00000000
/sys/fs/btrfs/c3c00bf0-73a6-4aca-91bb-b5e32e76a08c/devinfo/1/type:0x00000000
/sys/fs/btrfs/c3c00bf0-73a6-4aca-91bb-b5e32e76a08c/devinfo/2/type:0x00000001
/sys/fs/btrfs/c3c00bf0-73a6-4aca-91bb-b5e32e76a08c/devinfo/3/type:0x00000001

Balancing After Changing Preferences[edit | edit source]

After setting the hints, run a balance to apply the changes. If you added a SSD or NVMe device with metadata preference, you need to run a balance on metadata chunks so they are moved to the new device.

btrfs balance start -musage=100 /path/to/btrfs

You can see the distribution of data and metadata using btrfs filesystem usage -T:

# btrfs fi usage -T /media/backup/ 
Overall:
    Device size:                   2.82TiB
    Device allocated:              2.04TiB
    Device unallocated:          801.63GiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                          1.71TiB
    Free (estimated):              1.09TiB      (min: 718.16GiB)
    Free (statfs, df):             1.09TiB
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)
    Multiple profiles:                  no
                                 Data    Metadata System
Id Path                          single  RAID1    RAID1     Unallocated Total    Slack
-- ----------------------------- ------- -------- --------- ----------- -------- -----
 1 /dev/mapper/3t_backup         1.97TiB        -         -   777.70GiB  2.72TiB     -
 2 /dev/mapper/vg_800g-3TB_meta1       - 38.00GiB  32.00MiB    11.97GiB 50.00GiB     -
 3 /dev/mapper/vg_800g-3TB_meta2       - 38.00GiB  32.00MiB    11.97GiB 50.00GiB     -
-- ----------------------------- ------- -------- --------- ----------- -------- -----
   Total                         1.97TiB 38.00GiB  32.00MiB   801.63GiB  2.82TiB 0.00B
   Used                          1.66TiB 25.63GiB 384.00KiB

Important Considerations[edit | edit source]

One of the reasons why these patches are not included in the kernel is that the free space calculations do not work properly. It is therefore important to monitor the allocation of data and metadata using btrfs device usage and not rely on df.

Avoid using types 2 (metadata only) or 3 (data only) unless absolutely necessary, as they can lead to early no-space (ENOSPC) errors. Make sure that you monitor the allocation extra closely.

Example Use Case[edit | edit source]

Consider a mixed pool with one 512GB SSD and two 4TB HDDs:

  1. Set the SSD to prefer metadata (echo 1).
  2. Set the HDDs to prefer data (echo 0).
  3. Run a balance to optimise allocation.

This configuration ensures fast metadata access while maximising the storage capacity of the HDDs.

IMPORTANT! Always have backups of your data.

Conclusion[edit | edit source]

Allocator hints provide a powerful way to optimise performance and storage in Btrfs. By leveraging small but fast devices for metadata and larger but slower devices for data, users can achieve a balance of speed and capacity. As with any advanced feature, careful planning and monitoring are essential to avoid pitfalls like no-space errors. With proper use, allocator hints can significantly enhance the performance and flexibility of your Btrfs setup.

There are other advanced use-cases with a mix of NVMe, SSD and HDD devices, combined with bcache or dm-cache. Some are mentioned on kakra's GitHub page. There is also an interesting discussion on implementation details and requirements on https://github.com/btrfs/btrfs-todo/issues/19.

In the earlier example /dev/mapper/3t_backup is actually a dm-cache setup of a HDD with a SSD cache. You can read more about dm-cache on Blog/dm-cache: Linux Accelerated Storage and Linux/dm-cache.