Btrfs/Features

From Forza's ramblings

Btrfs

Green logotype with dark green grass and the text Btrfs.
Unofficial Btrfs logotype

Btrfs is a modern filesystem for Linux aimed at implementing advanced features while also focusing on fault tolerance, repair and easy administration. Btrfs can be used as a generic filesystem in most situations.

Originally developed in 2007, Btrfs has evolved steadily and continues to see heavy active development. It's on-disk format has been considered stable since 2013.

Btrfs combines many features traditionally found in md-raid and LVM, as well as introducing new concepts such as subvolumes and reflinks. This makes it difficult to compare Btrfs with traditional Linux filesystem like ext4.

One very significant feature is that Btrfs keeps checksums for all data, not only metadata. This means it can reliably detect (and automatically repair, depending on chosen profile) corruptions where it would go unnoticed in other filsystems or storage setups.

It important to understand that Btrfs is quite different than a traditional Linux filesystem because it bridges traditionally distinct storage layers; multiple device management (md RAID), volume management (LVM), data integrity verification (dm-integrity) and self-healing. This adds a great deal of flexibility that is very difficult to achieve across these boundaries with separate tools.

Btrfs is currently the only Linux filesystem with native support for Zoned storage. This useful on host managed SMR HDD's and on NVMe drives that supports the Zoned namespace. Read more about Zoned storage at https://zonedstorage.io/

Features

Copy-on-Write

Btrfs uses a technique called Copy-on-Write (CoW) for all writes to the filesystem. CoW means that a write always happens in a new block on the disk instead of overwriting an existing data block. Once the block is updated on disk, the metadata is updated to point to the new block. This ensures data integrity in case of a failed write - you either have the original data or the new data. If a write fails in a traditional filesystem, the contents of a datablock may instead be incomplete or wrong.

As of Linux Kernel 5.0 Btrfs has the following features:

Data checksum and integrity

  • Checksums on all data and metadata (crc32c, xxhash64, sha256 or blake2)
  • Self-healing in some configurations due to the nature of copy-on-write
  • Tree-checker, post-read and pre-write metadata verification
  • Online data scrubbing for finding errors and automatically fixing them for files with redundant copies
  • Offline filesystem check
  • Transparent compression via zlib, LZO and ZSTD, configurable per file or volume
  • Data deduplication using userspace tools
  • Online defragmentation as well as autodefrag mount option
  • In-place [1] from ext3/4 to Btrfs (with rollback)
  • Swap files
  • Block discard (trim support)
  • Offline filesystem check
  • File cloning (reflink, copy-on-write)
  • Quotas, subvolume-aware

Volume management

Volume management in Btrfs is the ability to combine and manage several disks as one filesystem.

  • Data and metadata profiles: SINGLE, DUP, RAID 0, RAID 1, RAID 1c34 and RAID 10
  • Subvolumes (one or more separately mountable filesystem roots within each volume)
  • Online volume growth and shrinking
  • Online block device addition and removal
  • Online balancing (moving blocks to balance load and make more efficient space-usage)
  • Online conversion between data profiles (convert between different RAID levels or RAID<->SINGLE/DUP)
  • Snapshots, writable and read-only
  • Incremental backup
  • Send/receive (saving diffs between snapshots to a binary stream)
  • Seed devices. Create a (read-only) filesystem that acts as a template to seed other Btrfs filesystems. Using copy on write, all modifications are stored on different devices and the original is unchanged.
  • Zoned device support (SMR/ZBC/ZNS friendly allocation)

The following profiles are supported:

Profile Description Disks Space Efficiency Resiliency
SINGLE For single disks or for spanned volumes (A.K.A. Just a Bunch Of Drives - JBOD) 1 disk or more. 100% None
MIXED* Combines metadata and data chunks into one. Useful for very small devices. Can be used on multiple devices. 1 disk or more. 100% None
DUP* DUP means duplicate. This ensures two copies exists on the same disk. Can be used on one or several drives like SINGLE mode but does not protect against disk failures. 1 disk or more 50% Some (*)
RAID0 Similar to SINGLE, but with data allocated in parallel stripes on all drives. Can increases performance in some workloads. 2 disks or more 100% None
RAID1 Like DUP, but stores each of the 2 copies on separate disks. 2 disks or more 50% 1 disk failure
RAID1c3 Stores 3 copies on separate disks. 3 disks or more 33.3% 2 disk failures
RAID1c4 Stores 4 copies on separate disks. 4 disks or more 25% 3 disk failures
RAID10 A combination of RAID1+RAID0 modes for increased performance and redundancy. 4 disks or more 50% 1 disk failure
RAID5* A striped mode with 1 disk as redundancy. Can increases performance in some workloads. 3 disks or more (N-1)/N 1 disk failure
RAID6* A striped mode with 2 disks as redundancy. Can increases performance in some workloads. 4 disks or more (N-1)/N 2 disk failures.
Mixed mode combines data and metadata in the same block groups. It can only be set when creating the filesystem with mkfs.btrfs and cannot be changed afterwards.
DUP mode protects against data or metadata corruption, but not disk failures.
RAID 5/6 modes are not yet stable or suitable for production use.

It is possible to use different profiles for metadata chunks and normal data chunks. For example dup profile for metadata and single profile for data chunks on a single disk.

Subvolumes

A subvolume is a part of filesystem with its own independent file and directory hierarchy. Subvolumes can be mounted as normal filesystems and they can be renamed or moved like normal directories. Nesting subvolumes inside each other is also possible.

A subvolume in Btrfs can be accessed in two ways:

  • like any other directory that is accessible to the user
  • as a separately mounted filesystem

When a Btrfs filesystem is created with mkfs.btrfs, an initial subvolume is created. Often referred to as top-level[1] or root volume. It is common to create /home and other mountpoints as subvolumes rather than dividing the physical disk into partitions.

A comparison between traditional disk partition with Btrfs subvolumes:

  • Subvolumes can share file extents (file data) between each other.
  • partitions are block-level separations and cannot share data.
  • All Subvolumes share the same available space as the whole filesystem.
  • Subvolumes can be snapshotted, renamed, deleted or made read-only.

Snapshots

A snapshot is a subvolume that is a clone (reflink) of another subvolume. They can be created as read-write or read-only. File modifications in a snapshot do not affect the files in the original subvolume.

  • Snapshots only store differences, so initially they take no additional disk space.
  • Snapshots can be used to store several revisions of the subvolume.
  • Snapshots do not have an incremental relationship. They do not depend on keeping the previous snapshots to remain valid.
  • Efficient incremental backups are possible using Btrfs send|receive. Snapshots can be sent to a another btrfs filesystem or to a different backup-location over the network. When using incremental snapshots, only the differences between each snapshot is sent, greatly reducing the space and time needed to make the backup.

Cloning and Deduplication

A great feature of Btrfs is the concept of cloning files in an atomic way. This usually called a reflink.

This allows the user to make an instant copy of a file, similar to a hard link. When either the original file or the copy is modified, CoW, ensures that the files remain unique from each other.

File cloning (reflink, copy-on-write) via cp:

cp --reflink <source file> <destination file>
Tip! Put an alias in your .bash_profile or /etc/profile.d/ for cp to always do reflinks. Blog/Bash Aliases.

Reflink copies of files and directories are useful ways to make instant copies before making changes, like a MediaWiki or WordPress upgrade.

Deduplication means to take two or more files and join equal parts as reflinked copies. If one of the files is changed, CoW ensures that the file remain unique from eachother. Deduplication can save much disk space. See the depuplication page for more in-depth usage.

Data Allocation

Btrfs allocates all data in chunks, also referred to as block groups. There are three different types of chunks; SYSTEM, METADATA and DATA.

Type Description
DATA Stores normal user file data
METADATA Stores internal metadata. Small files can also stored inline
SYSTEM Stores mapping between physical devices and the logical space representing the filesystem
UNALLOCATED Any unallocated space

It is possible to use different profiles for DATA and METADATA in order to maximize space usage or resiliency against corruption. For example, it is common to use DATA as SINGLE and METADATA and DUP profile on single disk filesystems.

Each block group is allocated from the unallocated space as needed. DATA and METADATA block groups are normally allocated 1GiB at the time, multiplied by what PROFILE is used. For a RAID1 filesystem, 2x1GiB block groups will be allocated each time.

Because of the dynamic way Btrfs allocates block groups, it is somewhat difficult to calculate available disk space. You have to account for the fact that METADATA is dynamic and that you can have different PROFILES.

Example of a single disk filesystem using DUP and SINGLE profiles. You can see how METADATA DUP profile doubles the allocated space to 12GiB:

# btrfs filesystem usage /mnt
 Overall:
    Device size:                 233.47GiB
    Device allocated:            108.06GiB
    Device unallocated:          125.41GiB
    Device missing:                  0.00B
    Used:                         71.30GiB
    Free (estimated):            153.02GiB      (min: 90.32GiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              195.05MiB      (used: 0.00B)
    Multiple profiles:                  no
 
 Data,single: Size:96.00GiB, Used:68.38GiB (71.23%)
   /dev/sda3      96.00GiB
 
 Metadata,DUP: Size:6.00GiB, Used:1.46GiB (24.29%)
   /dev/sda3      12.00GiB
 
 System,DUP: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/sda3      64.00MiB
 
 Unallocated:
   /dev/sda3     125.41GiB
  1. Btrfs glossary[2]