Btrfs/Balance

From Forza's ramblings


Btrfs Balance[edit | edit source]

A Yellow Bittern, trying to find the most comfortable position for its morning food hunt.

Btrfs-balance is a tool to manage and maintain a Btrfs filesystem.

Its main purposes are:

  • converting between profiles (RAID modes).
  • distributing data evenly across devices when devices has been added or removed.
  • maintain unallocated disk space.

It is important to learn about why and when to use Balance in order to keep the filesystem healthy.

The Btrfs allocator[edit | edit source]

Btrfs manages disk space in a multi-stage allocation process. In the initial stage, Btrfs allocates large regions of space called chunks tailored for specific data types. Subsequently, Btrfs further divides these chunks into smaller page-sized blocks, resembling the traditional allocation within a filesystem. On AMD/Intel systems, the page size is 4096 bytes. Other systems may have larger page sizes.

Within the allocated chunks, Btrfs uses a unit called extents to store file data. Extents consist of one or several blocks and serve as a fundamental unit for managing file data. Maximum extent size in Btrfs is 128MiB.

Chunks are stored together as block groups. Depending on the profile used, one or several chunks are allocated together in a block group.

The types of block groups are:

Type Description
DATA Stores normal user file data.
METADATA Stores internal metadata. Small files can also stored inline.
SYSTEM Stores mapping between physical devices and the logical space representing the filesystem.
UNALLOCATED Any unallocated space in the filesystem.
NOTE: Only the type of data that the chunk is allocated for can be stored in that block group.

With some usage patterns, the ratio between the various chunks can become askewed, which in turn can lead to out-of-disk-space (ENOSPC) errors if left unchecked. This happens if Btrfs needs to allocate new block group, but there is not enough unallocated disk space available.

btrfs balance is used to re-arrange and compact chunks, freeing upp UNALLOCATED disk space. The unallocated space will then automatically be re-purposed as DATA, METADATA or SYSTEM chunks as needed dring normal usage of the filesystem.

How to see actual disk usage (don't trust 'df')[edit | edit source]

In most cases the normal df tool can used to see available disk space of a filesystem. It usually gives a good estimate on available disk space.

# df -h /
 Filesystem  Size  Used  Avail  Use% Mounted on
 /dev/sdb1   32G   2.2G  29G    8%   /

Unfortunately, because of Btrfs's two-stage allocator, df may not always be accurate. For example it cannot tell how much unallocated disk space is available.

To see details on Btrfs disk usage, you need to use btrfs filesystem usage. This shows how each chunk type is allocated and how much unallocated space is available.

# btrfs fi us /
Overall:
    Device size:                  32.00GiB
    Device allocated:              4.52GiB
    Device unallocated:           27.48GiB
    Device missing:                  0.00B
    Used:                          2.17GiB
    Free (estimated):             28.08GiB      (min: 14.34GiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:               16.03MiB      (used: 0.00B)
    Multiple profiles:                  no

Data,single: Size:2.01GiB, Used:1.41GiB (70.04%)
   /dev/sdb1       2.01GiB

Metadata,DUP: Size:1.25GiB, Used:392.84MiB (30.69%)
   /dev/sdb1       2.50GiB

System,DUP: Size:8.00MiB, Used:16.00KiB (0.20%)
   /dev/sdb1      16.00MiB

Unallocated:
   /dev/sdb1      27.48GiB

As you can see we have 27GiB unallocated space while df shows 29GiB. We can calculate this as unused DATA size (2.01-1.41) + unused metadata size (1.25-0.4) + Unallocated size (27.48), which is ~29GiB. It is important to understand that it does not take into account that further metadata chunks will most likely be needed as the filesystem fills up, lessening the available disk space for data even more. Different Btrfs profiles with multiple devices further complicates the calculation.

How much metadata that is needed varies greatly depending on how you use the filesystem. Lots of small files, snapshots, compression, and file fragmentation use more metadata space than few large files. This is why it is difficult to estimate exactly how much available space there is in the filesystem.

Normally, Btrfs manages the allocation of data and metadata chunks without the need for user intervention. However, with some usage patters, the filesystem can end up with too little unallocated space so that Btrfs cannot allocate additional data or metadata chunks.

If Btrfs cannot allocate additional data chunks, you simply get a out of disk space error when trying to write files. However if Btrfs cannot allocate additional metadata chunks, it would lead to a ENOSPC error and Btrfs will turn the filesystem read-only to protect itself from corruption and data loss.

To avoid this situation you can do regular btrfs balance to compact under-used data block groups and free up unallocated space.

TIP: Balancing data block groups creates more continuous free space, which can improve write speeds. It can be viewed as defragmenting the available free disk space

It is a good practice to monitor your disk usage using btrfs filesystem usage and run balance as needed.

WARNING![edit | edit source]

WARNING: Do not balance metadata chunks regularly as this can increase the risk for ENOSPC errors. It is only recommended to run a metadata balance when converting between RAID profiles or when changing the number of devices in the filesystem.

It is is good to have plenty of free space inside metadata chunks. The filesystem uses the metatdata space in all its normal operations, and when available metadata space runs out, Btrfs will try to allocate new metadata chunks. However, if there is no available Unallocated space when the Btrfs needs to allocate additional metadata chunks, the filesystem will turn read-only and will require manual intervention to recover.

Btrfs balance[edit | edit source]

Usage[edit | edit source]

# btrfs balance start --help
usage: btrfs balance start [options] <path>

    Balance chunks across the devices

    Balance and/or convert (change allocation profile of) chunks that
    passed all filters in a comma-separated list of filters for a
    particular chunk type.  If filter list is not given balance all
    chunks of that type.  In case none of the -d, -m or -s options is
    given balance all chunks in a filesystem. This is potentially
    long operation and the user is warned before this start, with
    a delay to stop it.

    -d[filters]    act on data chunks
    -m[filters]    act on metadata chunks
    -s[filters]    act on system chunks (only under -f)
    -f             force a reduction of metadata integrity
    --full-balance do not print warning and do not delay start
    --background|--bg
                   run the balance as a background process
    --enqueue      wait if there's another exclusive operation running,
                   otherwise continue
    -v|--verbose   deprecated, alias for global -v option

    Global options:
    -v|--verbose       increase output verbosity
    -q|--quiet         print only errors

Full man page of btrfs-balance is available at https://btrfs.readthedocs.io/en/latest/btrfs-balance.html

Running Balance[edit | edit source]

Running btrfs balance start without any filters, would re-write every data and metadata chunk in the filesystem. Usually, this is not what we want. Instead, use balance filters to limit what chunks should be balanced.

Using -dusage=5 we limit balance to compact data blocks that are less than 5% full. This is a good start, and we can increase it to 10-15% or more if needed. A small (less than 100GiB) filesystem may need a higher number. The goal here is to make sure there is enough Unallocated space on each device in the filesystem to avoid the ENOSPC situation.

# btrfs balance start -dusage=5 /
Done, had to relocate 1 out of 68 chunks

Before balance:

# btrfs fi us -T /
Overall:
    Device size:                 229.47GiB
    Device allocated:             74.06GiB
    Device unallocated:          155.41GiB
    Device missing:                  0.00B
    Used:                         57.10GiB
    Free (estimated):            162.65GiB      (min: 84.94GiB)
    Free (statfs, df):           162.65GiB
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              233.92MiB      (used: 0.00B)
    Multiple profiles:                  no
 
             Data     Metadata System
Id Path      single   DUP      DUP      Unallocated
-- --------- -------- -------- -------- -----------
 1 /dev/sda3 60.00GiB 14.00GiB 64.00MiB   159.41GiB
-- --------- -------- -------- -------- -----------
   Total     60.00GiB  7.00GiB 32.00MiB   159.41GiB
   Used      52.76GiB  2.17GiB 16.00KiB

After balance:

# btrfs fi us -T /
Overall:
    Device size:                 229.47GiB
    Device allocated:             73.06GiB
    Device unallocated:          156.41GiB
    Device missing:                  0.00B
    Used:                         57.01GiB
    Free (estimated):            162.72GiB      (min: 84.52GiB)
    Free (statfs, df):           162.72GiB
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              233.92MiB      (used: 0.00B)
    Multiple profiles:                  no
 
             Data     Metadata System
Id Path      single   DUP      DUP      Unallocated
-- --------- -------- -------- -------- -----------
 1 /dev/sda3 59.00GiB 14.00GiB 64.00MiB   160.41GiB
-- --------- -------- -------- -------- -----------
   Total     59.00GiB  7.00GiB 32.00MiB   160.41GiB
   Used      52.68GiB  2.16GiB 16.00KiB

We can see we freed up 1GiB of Unallocated disk space by compacting the data chunks. There are now have 59 instead of 60 data chunks to hold 52.68GiB.

NOTE: A data or metadata chunk is usually 1GiB. On very small filesystems, and in some special situations, they can be smaller.

Scheduling Balance[edit | edit source]

If you have a lot of writes and changes to your filesystem it may be a good idea to schedule a balance job once a week or so. You can use cron (as the example below) or systemd timers to do the same.

Example crontab that runs balance at 3 AM every Sunday:

# FILE: /etc/cron.d/btrfs-balance

# For details see `man 5 crontab`
# job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,Apr,...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  | 
# *  *  *  *  * user-name  command to be executed
  0  3  *  *  6 root       btrfs balance start -dusage=10,limit=1 /mnt/some/mountpoint >/dev/null 2>&1

Automatic Balance[edit | edit source]

Since Linux kernel 5.19 there is a sysfs knob to enable automatic block group reclaim. This is essentially the kernel automatically balancing individual block groups as they fall under a certain threshold.

By default completely empty block groups are reclaimed into free space automatically. Using the sysfs knob bg_reclaim_threshold, it is now it is possible to to set another threshold than 0. The full sysfs path is /sys/fs/btrfs/<FSID>/allocation/<PROFILE>/bg_reclaim_threshold.

  • FSID = filesystem UUID. Use btrfs filesystem show to list current filesystems.
  • PROFILE = DATA, METADATA or SYSTEM chunks.

Example:

# btrfs filesystem show /mnt/virtiofs/
Label: 'virtio-backing-store'  uuid: c3c00bf0-73a6-4aca-91bb-b5e32e76a08c
        Total devices 1 FS bytes used 9.71GiB
        devid    1 size 50.00GiB used 25.02GiB path /dev/mapper/pool-Btrfs_virtiofs

To see current setting:

# cat /sys/fs/btrfs/c3c00bf0-73a6-4aca-91bb-b5e32e76a08c/allocation/data/bg_reclaim_threshold
0

Use echo to change the setting.

# echo 10 > /sys/fs/btrfs/c3c00bf0-73a6-4aca-91bb-b5e32e76a08c/allocation/data/bg_reclaim_threshold

Here, 10, means a threshold of 10%. The kernel will now consider block groups that fall below this amount of usage for automatic reclaiming.

The kernel will periodically check and issue balance commands as needed. Progress is shown in kernel log.

# dmesg
[108009.145638] BTRFS info (device sdc2): reclaiming chunk 3966432706560 with 0% used 0% unusable
[108009.154890] BTRFS info (device sdc2): relocating block group 3966432706560 flags data
[108009.553770] BTRFS info (device sdc2): found 2 extents, stage: move data extents
[108012.723075] BTRFS info (device sdc2): found 2 extents, stage: update data pointers
[108014.180739] BTRFS info (device sdc2): reclaiming chunk 3963211481088 with 0% used 0% unusable
NOTE: These settings are not persistent and will reset back to 0 when the filesystem is unmounted or system is rebooted.