Btrfs/Replacing a disk

From Forza's ramblings


Replacing a disk in a btrfs filesystem[edit | edit source]

The read/write head crashed on the platter in a HDD

btrfs replace replaces an existing disk by doing a byte-for-byte copy of all data from the old to the new disk. It is similar to a dd clone, but works on the filesystem level.

Warning! Do not use dd, LVM snapshots or other disk cloning tools to copy Btrfs filesystems. There are some important caveats that needs to be considered. On older kernels it can directly lead to corruption!

btrfs replace is the preferred method of replacing a disk in a btrfs filesystem, especially when there is a damaged or missing device. While btrfs device add + btrfs device remove also works, it is a much slower method and can cause issues if there are read/write errors. btrfs replace is not only faster, it handles failures and errors better.

Btrfs replace should be used on a mounted filesystem. If you have a missing disk in a filesystem with a redundant RAID profile, you can mount the filesystem using the degraded mount option.

mount -o degraded /dev/sdb1 /mnt/btrfs 
Note: Systemd may automatically unmount a filesystem from fstab if you manually mount it degraded! Avoid this by commenting out the fstab line and do systemctl daemon-reload before mounting the filesystem in degraded mode

Prepare the new disk[edit | edit source]

Before you start replacing the old disk you should prepare the new disk.

Even though Btrfs supports raw disks, it is recommended that you do have a partition table on your disk to avoid confusion with other filesystems and tools. You can use fdisk or cfdisk from the util-linux package or GNU parted to create a partition table and a partion to hold your new btrfs filesystem. Use a GUID Partition Table (GPT) instead of the old DOS MBR style partition table. GPT supports larger than 2TiB sized disks and has a backup copy.

If you have an NVME or SSD disk, it is good practice to empty it using blkdiscard. Discard tells the drive's firmware that the disk is empty and it improves it's performance and wear. Do this before you create any partition tables as it will erase everything of the disk.

Here is a basic example on how to use GNU partedcreate a GPT partition table and one partition that fills the whole device /dev/nvme0n4

1) First we issue blkdiscard to clear the new disk.

# blkdiscard /dev/nvme0n4 -v
/dev/nvme0n4: Discarded 10737418240 bytes from the offset 0

2) Then we create a new partition table using parted.

# parted /dev/nvme0n4
GNU Parted 3.4
Using /dev/nvme0n4
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel gpt
(parted) mkpart primary btrfs 4MiB 100%
(parted) print
Model: ORCL-VBOX-NVME-VER12 (nvme)
Disk /dev/nvme0n4: 10.7GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name     Flags
 1      4194kB  10.7GB  10.7GB  btrfs        primary

(parted) quit
Information: You may need to update /etc/fstab.

Replacing with equal sized or a larger disk[edit | edit source]

The most common case is when you want to replace a disk with a new disk of equal or larger size.

First you need to check what devid the old disk has:

# btrfs filesystem show
Label: 'my-vault'  uuid: df68a30d-d26e-4b9c-9606-a130e66ce63d
        Total devices 1 FS bytes used 658.88GiB
        devid    1 size 931.51GiB used 667.02GiB path /dev/sdc1

Now you can start the replacing process.

# btrfs replace start <id> <new-disk> <mount-point>
# btrfs replace start 1    /dev/sdd1  /mnt/my-vault/ 

This will move all data from the old ddisk /dev/sdc1 to the new disk/dev/sdd1.

Replace continues in the background. When it is complete you can use the disk for other purposes or remove it from the system. See the status monitoring chapter for checking the progress.

If the new disk is larger than the old, you need to resize the filesystem to take advantage of the new size:

# btrfs filesystem resize 1:max /mnt/my-vault

The special max keyword ensures that Btrfs uses all available space on the disk.

Replacing a disk in a RAID array[edit | edit source]

The process of replacing a disk in a multi-disk filesystem works the same way.

Consider this 4-disk RAID-1 filesystem:

# btrfs filesystem show /mnt/raid1/
Label: 'vault'  uuid: da2028f1-377d-4100-849b-29e39b869137
        Total devices 4 FS bytes used 6.19GiB
        devid    1 size 8.00GiB used 3.25GiB path /dev/nvme0n1
        devid    2 size 8.00GiB used 3.25GiB path /dev/nvme0n2
        devid    3 size 8.00GiB used 3.26GiB path /dev/nvme0n3
        devid    4 size 8.00GiB used 3.26GiB path /dev/nvme0n4

To replace /dev/nvme0n1 use the same command as before:

# btrfs replace start <id> <new-disk> <mount-point>
# btrfs replace start 1 /dev/nvme0n5 /mnt/raid1/

Once replacing is completed (see status monitoring) you can use btrfs filesystem show to see the new layout.

# btrfs filesystem show /mnt/raid1/
Label: 'vault'  uuid: da2028f1-377d-4100-849b-29e39b869137
        Total devices 4 FS bytes used 6.19GiB
        devid    1 size 8.00GiB used 3.25GiB path /dev/nvme0n5
        devid    2 size 8.00GiB used 3.25GiB path /dev/nvme0n2
        devid    3 size 8.00GiB used 3.26GiB path /dev/nvme0n3
        devid    4 size 8.00GiB used 3.26GiB path /dev/nvme0n4

If the new disk is larger than the old, you need to resize the filesystem to take advantage of the new size:

# btrfs filesystem resize 1:max /mnt/my-vault

The special max keyword ensures that Btrfs uses all available space on the disk.

Replacing with a smaller disk[edit | edit source]

btrfs replace can only replace a disk of equal size or bigger. If your new disk is smaller than the disk you intend to replace you need to shrink the filesystem before you can attempt a replacement.

You need to determine what devid and size the old disk has:

# btrfs filesystem show
Label: 'my-vault'  uuid: df68a30d-d26e-4b9c-9606-a130e66ce63d
        Total devices 1 FS bytes used 658.88GiB
        devid    1 size 931.51GiB used 667.02GiB path /dev/sdc1

For example, if the new disk is only 800GiB so we need to resize /dev/sdc1 to less than that

# btrfs filesystem resize <id>:<size> <mount-point>
# btrfs filesystem resize  1:799GiB   /mnt/my-vault

Now you can start the replacing process.

# btrfs replace start <id> <new-disk> <mount-point>
# btrfs replace start  1   /dev/sdd1   /mnt/my-vault/

This will move all data from the old disk /dev/sdc1 to the new disk/dev/sdd1. When it is complete you can physically remove the old disk from your system. Once replace is completed (see status monitoring) you should make sure the filesystem uses all space on the new disk.

# btrfs filesystem resize <id>:<size> <mount-point>
# btrfs filesystem resize 1:max /mnt/my-vault

The special max keyword ensures btrfs uses all available space on the disk.

Replacing a failed disk[edit | edit source]

Replacing a failed disk in a RAID array can be done in two ways depending on your situation. It is advised that you ask help before attempting to repair your failing filesystem. https://wiki.tnonline.net/w/Category:Btrfs#Help

Disk is online but is having errors[edit | edit source]

If a disk is having read errors you can use the same process described chapter 3. btrfs replace has a special option to avoid reading from a failed disk when possible. Reading from disks with bad blocks can be very slow, so this option will help a lot.

# btrfs replace start --help
 -r     only read from <srcdev> if no other zero-defect mirror exists (enable this if your drive has lots of read errors, the access would be very slow)

Example:

# btrfs replace start -r <id> <new-disk> <mount-point>
# btrfs replace start -r 5 /dev/nvme0n5 /mnt/raid10/

Disk is dead or removed from the system[edit | edit source]

If the filesystem is not mounted, you need to mount your disk with -o degraded, as the kernel won't mount a filesystem if some disks are missing.

# mount /dev/nvme0n1 /mnt/my-vault/
mount: /mnt/my-vault: wrong fs type, bad option, bad superblock on /dev/nvme0n1, missing codepage or helper program, or other error.

We can see in kernel logs using dmesg why the mount command failed.

# dmesg
[   39.537920] BTRFS error (device nvme0n1): devid 4 uuid a14e2826-db7b-41cc-a4b9-6f0d599a0c24 is missing
[   39.537925] BTRFS error (device nvme0n1): failed to read the system array: -2
[   39.538465] BTRFS error (device nvme0n1): open_ctree failed
# mount /dev/nvme0n2 /mnt/my-vault/ -o degraded
# btrfs filesystem show /mnt/my-vault
Label: 'vault'  uuid: 7714d5de-5407-4fbe-b356-82bd086f6ded
        Total devices 4 FS bytes used 5.62GiB
        devid    1 size 8.00GiB used 3.20GiB path /dev/nvme0n1
        devid    2 size 8.00GiB used 3.20GiB path /dev/nvme0n2
        devid    3 size 8.00GiB used 3.20GiB path /dev/nvme0n3
        *** Some devices missing

First you need to find out the device ID of the missing disk.

 # btrfs device usage /mnt/my-vault/
/dev/nvme0n1, ID: 1
   Device size:             8.00GiB
   Device slack:              0.00B
   Data,RAID10/4:           3.00GiB
   Metadata,RAID10/4:     192.00MiB
   System,RAID10/4:         8.00MiB
   Unallocated:             4.80GiB

/dev/nvme0n2, ID: 2
   Device size:             8.00GiB
   Device slack:              0.00B
   Data,RAID10/4:           3.00GiB
   Metadata,RAID10/4:     192.00MiB
   System,RAID10/4:         8.00MiB
   Unallocated:             4.80GiB

/dev/nvme0n3, ID: 3
   Device size:             8.00GiB
   Device slack:              0.00B
   Data,RAID10/4:           3.00GiB
   Metadata,RAID10/4:     192.00MiB
   System,RAID10/4:         8.00MiB
   Unallocated:             4.80GiB

missing, ID: 4
   Device size:               0.00B
   Device slack:              0.00B
   Data,RAID10/4:           3.00GiB
   Metadata,RAID10/4:     192.00MiB
   System,RAID10/4:         8.00MiB
   Unallocated:             4.80GiB

Now can replace the missing device with a new disk.

# btrfs replace start 4 /dev/nvme0n5 /mnt/my-vault

Replace continues in the background. See the status monitoring chapter on how to monitor the progress.

# btrfs filesystem show  /mnt/my-vault/
Label: 'vault'  uuid: 7714d5de-5407-4fbe-b356-82bd086f6ded
        Total devices 4 FS bytes used 5.62GiB
        devid    1 size 8.00GiB used 4.48GiB path /dev/nvme0n1
        devid    2 size 8.00GiB used 5.45GiB path /dev/nvme0n2
        devid    3 size 8.00GiB used 4.45GiB path /dev/nvme0n3
        devid    4 size 8.00GiB used 3.20GiB path /dev/sde1

Restoring redundancy after a replaced disk[edit | edit source]

IMPORTANT: Because btrfs can not write any data to a missing device, it writes data to single profile chunks. To restore full redundancy you should run btrfs balance to convert chunks to the correct RAID profile.

Use btrfs filesystem usage -T to see how chunks are allocated.

# btrfs fi usage -T /mnt/my-vault/ 
Overall:
    Device size:                  32.00GiB
    Device allocated:             17.56GiB
    Device unallocated:           14.44GiB
    Device missing:                  0.00B
    Used:                         11.25GiB
    Free (estimated):             12.26GiB      (min: 10.46GiB)
    Free (statfs, df):            15.40GiB
    Data ratio:                       1.60
    Metadata ratio:                   1.33
    Global reserve:               17.92MiB      (used: 0.00B)
    Multiple profiles:                 yes      (data, metadata, system)

                Data     Data    Metadata  Metadata  System    System
Id Path         single   RAID10  single    RAID10    single    RAID10   Unallocated
-- ------------ -------- ------- --------- --------- -------- -------- -----------
 1 /dev/nvme0n1  1.00GiB 3.00GiB 256.00MiB 192.00MiB 32.00MiB  8.00MiB     3.52GiB
 2 /dev/nvme0n2  2.00GiB 3.00GiB 256.00MiB 192.00MiB        -  8.00MiB     2.55GiB
 3 /dev/nvme0n3  1.00GiB 3.00GiB 256.00MiB 192.00MiB        -  8.00MiB     3.55GiB
 4 /dev/sde1           - 3.00GiB         - 192.00MiB        -  8.00MiB     6.80GiB
-- ------------ -------- ------- --------- --------- -------- -------- -----------
   Total         4.00GiB 6.00GiB 768.00MiB 384.00MiB 32.00MiB 16.00MiB    16.44GiB
   Used         64.00KiB 5.41GiB  16.00KiB 222.62MiB    0.00B 16.00KiB

Use the convert and soft keywords to convert the single chunks to the correct profile:

 # btrfs balance start -dconvert=raid10,soft -mconvert=raid10,soft /mnt/raid10/
 Done, had to relocate 4 out of 8 chunks

You have now restored full redundancy:

# btrfs fi usage /mnt/my-vault/ -T
Overall:
    Device size:                  32.00GiB
    Device allocated:             28.00GiB
    Device unallocated:            4.00GiB
    Device missing:                  0.00B
    Used:                         11.25GiB
    Free (estimated):              9.45GiB      (min: 9.45GiB)
    Free (statfs, df):             9.45GiB
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:               17.92MiB      (used: 0.00B)
    Multiple profiles:                  no

                Data     Metadata  System
Id Path         RAID10   RAID10    RAID10   Unallocated
-- ------------ -------- --------- -------- -----------
 1 /dev/nvme0n1  6.43GiB 544.00MiB 40.00MiB     1.00GiB
 2 /dev/nvme0n2  6.43GiB 544.00MiB 40.00MiB     1.00GiB
 3 /dev/nvme0n3  6.43GiB 544.00MiB 40.00MiB     1.00GiB
 4 /dev/sde1     6.43GiB 544.00MiB 40.00MiB     3.00GiB
-- ------------ -------- --------- -------- -----------
   Total        12.86GiB   1.06GiB 80.00MiB     6.00GiB
   Used          5.41GiB 222.64MiB 16.00KiB

Status monitoring[edit | edit source]

A disk replacement can take several hours. Luckily it is is possible to monitor the status using btrfs replace status <mount-point>.

# btrfs replace status /mnt/my-vault
Started on 24.Jul 11:02:41, finished on 24.Jul 11:41:51, 0 write errs, 0 uncorr. read errs

You can also see status messages in the kernel log:

# dmesg -H
[Aug 7 14:52] BTRFS info (device nvme0n1): dev_replace from /dev/nvme0n1 (devid 1) to /dev/nvme0n5 started
[ +14.731116] BTRFS info (device nvme0n1): dev_replace from /dev/nvme0n1 (devid 1) to /dev/nvme0n5 finished

Reference[edit | edit source]

The btrfs-replace reference manual can be found at https://btrfs.readthedocs.io/en/latest/btrfs-replace.html