From Forza's ramblings

Btrfs[edit | edit source]

Green logotype with dark green grass and the text Btrfs.
Unofficial Btrfs logotype

Btrfs is a modern filesystem for Linux aimed at implementing advanced features while also focusing on fault tolerance, repair and easy administration. Btrfs can be used as a generic filesystem in most situations.

Originally developed in 2007, Btrfs has evolved steadily and continues to see heavy active development. It's on-disk format has been considered stable since 2013.

Btrfs combines many features traditionally found in md-raid and LVM, as well as introducing new concepts such as subvolumes and reflinks. This makes it difficult to compare Btrfs with traditional Linux filesystem like ext4.

One very important feature is that Btrfs keeps checksums for all data, not only metadata. This means it can reliably detect (and automatically repair, depending on chosen profile) corruptions where it would go unnoticed in other filsystems or storage setups.

It important to understand that Btrfs is quite different than a traditional Linux filesystem because it bridges traditionally distinct storage layers; multiple device management (md RAID), volume management (LVM), data integrity verification (dm-integrity) and self-healing. This adds a great deal of flexibility that is very difficult to achieve across these boundaries with separate tools.

Features[edit | edit source]

Copy-on-Write[edit | edit source]

Btrfs uses a technique called Copy-on-Write (CoW) for all writes to the filesystem. CoW means that a write always happens in a new block on the disk instead of overwriting an existing data block. Once the block is updated on disk, the metadata is updated to point to the new block. This ensures data integrity in case of a failed write - you either have the original data or the new data. If a write fails in a traditional filesystem, the contents of a datablock may instead be incomplete or wrong.

As of Linux Kernel 5.0 Btrfs has the following features:

Data checksum and integrity[edit | edit source]

  • Checksums on all data and metadata (crc32c, xxhash64, sha256 or blake2)
  • Self-healing in some configurations due to the nature of copy-on-write
  • Tree-checker, post-read and pre-write metadata verification
  • Online data scrubbing for finding errors and automatically fixing them for files with redundant copies
  • Offline filesystem check
  • Transparent compression via zlib, LZO and ZSTD, configurable per file or volume
  • Data deduplication using userspace tools
  • Online defragmentation as well as autodefrag mount option
  • In-place conversion from ext3/4 to Btrfs (with rollback)
  • Swap files
  • Block discard (trim support)
  • Offline filesystem check
  • File cloning (reflink, copy-on-write)
  • Quotas, subvolume-aware

Volume management[edit | edit source]

Volume management in Btrfs is the ability to combine and manage several disks as one filesystem.

  • Data and metadata profiles: SINGLE, DUP, RAID 0, RAID 1, RAID 1c34 and RAID 10
  • Subvolumes (one or more separately mountable filesystem roots within each volume)
  • Online volume growth and shrinking
  • Online block device addition and removal
  • Online balancing (moving blocks to balance load and make more efficient space-usage)
  • Online conversion between data profiles (convert between different RAID levels or RAID<->SINGLE/DUP)
  • Snapshots, writable and read-only
  • Incremental backup
  • Send/receive (saving diffs between snapshots to a binary stream)
  • Seed devices. Create a (read-only) filesystem that acts as a template to seed other Btrfs filesystems. Using copy on write, all modifications are stored on different devices and the original is unchanged.
  • Zoned device support (SMR/ZBC/ZNS friendly allocation)

The following profiles are supported:

Profile Description Disks Space Efficiency
SINGLE For single disks or for spanned volumes (A.K.A. Just a Bunch Of Drives - JBOD) 1 disk or more. 100%
DUP DUP means duplicate. This ensures two copies exists on the same disk. Can be used on one or several drives like SINGLE mode but does not protect against disk failures. 1 disk or more 50%
RAID0 Similar to SINGLE, but with data allocated in parallel stripes on all drives. Can increases performance in some workloads. 2 disks or more 100%
RAID1 Like DUP, but stores 2 copies on separate disks. 2 disks or more 50%
RAID1c3 Stores 3 copies on separate disks. 3 disks or more 33.3%
RAID1c4 Stores 4 copies on separate disks. 4 disks or more 25%
RAID10 A combination of RAID1+RAID0 modes for increased performance in some workloads. 4 disks or more 50%
RAID5* Adds 1 disk as redundancy. 3 disks or more (N-1)/N
RAID6* Adds 2 disks as redundancy. 4 disks or more (N-1)/N
Note that RAID 5/6 modes are not yet stable

It is possible to use different profiles for metadata chunks and normal data chunks. For example dup profile for metadata and single profile for data chunks on a single disk.

Subvolumes[edit | edit source]

A subvolume is a part of filesystem with its own independent file and directory hierarchy. Subvolumes can be mounted as normal filesystems and they can be renamed or moved. Nesting subvolumes inside each other is also possible.

A subvolume in Btrfs can be accessed in two ways:

  • like any other directory that is accessible to the user
  • as a separately mounted filesystem

When a Btrfs filesystem is created with mkfs.btrfs, an initial subvolume is created. Often referred to as top-level[1] or root volume. It is common to create /home and other mountpoints as subvolumes rather than dividing the physical disk into partitions.

A comparison between traditional disk partition with Btrfs subvolumes:

  • Subvolumes can share file extents (file data) between each other.
  • partitions are block-level separations and cannot share data.
  • All Subvolumes share the same available space as the whole filesystem.
  • Subvolumes can be snapshotted, renamed, deleted or made read-only.

Snapshots[edit | edit source]

A snapshot is a subvolume that is a clone (reflink) of another subvolume. They can be created as read-write or read-only. File modifications in a snapshot do not affect the files in the original subvolume.

  • Read-only snapshots can be used to store incremental revisions of the filesystem.
  • Efficient incremental backup are possible using Btrfs send|receive. Snapshots can be sent to a another btrfs filesystem or to a different backup-location over the network. When using incremental snapshots, only the differences between each snapshot is sent, greatly reducing the space and time needed to make the backup.

Cloning and Deduplication[edit | edit source]

A rather unique feature of Btrfs is the concept of cloning files in an atomic way. This usually called a reflink.

This allows the user to make an instant copy of a file, similar to a hard link. When the original file or the copy is modified, CoW, ensures that the files remain unique from each other.

File cloning (reflink, copy-on-write) via cp:

cp --reflink <source file> <destination file>
Tip! Put an alias in your .bash_profile or /etc/profile.d/ for cp to always do reflinks. Blog/Bash Aliases.

Deduplication means to two take two or more files and join equal parts as reflinked copies. If one of the files is changed, CoW makes sure that the file remain unique from eachother. Deduplication can save much disk space. See the depuplication page for more in-depth usage.

Data Allocation[edit | edit source]

Btrfs allocates all data in block groups. There are different types; SYSTEM, METADATA and DATA.

Type Description
DATA Stores normal user file data
METADATA Stores internal metadata. Small files can also stored inline
SYSTEM Stores mapping between physical devices and the logical space representing the filesystem
UNALLOCATED Any unallocated space

It is possible to use different profiles for DATA and METADATA in order to maximize space usage or resiliency against corruption. For example, it is common to use DATA as SINGLE and METADATA and DUP profile on single disk filesystems.

Each block group is allocated from the unallocated space as needed. DATA and METADATA block groups are normally allocated 1GiB at the time, multiplied by what PROFILE is used.

Because of the dynamic way Btrfs allocates block groups, it is somewhat difficult to calculate available disk space. You have to account for the fact that METADATA is dynamic and that you can have different PROFILES.

Example of a single disk filesystem using DUP and SINGLE profiles. You can see how METADATA DUP profile doubles the allocated space to 12GiB:

# btrfs filesystem usage /mnt
    Device size:                 233.47GiB
    Device allocated:            108.06GiB
    Device unallocated:          125.41GiB
    Device missing:                  0.00B
    Used:                         71.30GiB
    Free (estimated):            153.02GiB      (min: 90.32GiB)
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              195.05MiB      (used: 0.00B)
    Multiple profiles:                  no
 Data,single: Size:96.00GiB, Used:68.38GiB (71.23%)
   /dev/sda3      96.00GiB
 Metadata,DUP: Size:6.00GiB, Used:1.46GiB (24.29%)
   /dev/sda3      12.00GiB
 System,DUP: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/sda3      64.00MiB
   /dev/sda3     125.41GiB
  1. Btrfs glossary[1]