Blog/dm-cache: Linux Accelerated Storage

From Forza's ramblings

DM Cache: Linux Tiered Storage[edit | edit source]

Photograph of an old stone temple in a jungle setting with large trees growing through the roof.
Ta Prohm temple, Cambodia, was founded by the Khmer King Jayavarman VII as a Mahayana Buddhist monastery and center of learning dedicated to his mother.

There are several technologies for combining large, but slow HDDs with a small but fast flash-based storage to improve the overall performance. Some common options include:

  • Bcache: Bcache is a Linux kernel block layer cache that can use an SSD as a cache device for a slower HDD. It's known for its ease of use and seamless integration.
  • LVM Cache: The Logical Volume Manager (LVM) provides a cache feature that allows you to use an SSD as a cache device for HDDs managed by LVM.
  • dm-cache: Linux dm-cache, or Device Mapper Cache, is another option in this category. It is similar to Bcache but is a more low-level option built on Device Mapper.

If you were building a storage stack from scratch, using empty devices, it is probably a good idea to use LVM or Bcache. Bcache is the easiest to set up using the bcache-tools package, while LVM provides the most flexibility. If you are faliliar with LVM or already use LVM, then LVM cache is a good choice.

Why dm-cache?[edit | edit source]

Like Bcache and LVM cache, the goal with dm-cache is to improve random read write performance of the slow HDD by using small but fast SSD or NVME device.

The main advantage of dm-cache is that it is possible to setup on devices that already have a filesystem with data on them. Both LVM and Bcache requires unformatted, empty devices (there are ways to get around, but can be risky).

Compared to LVM and Bcache, dm-cache is the most low-level option. It is managed using the dmsetup tool. Being a low-level solution, the setup and management can be quite difficult to grasp.

dm-cache[edit | edit source]

dm-cache requires tree devices:

  • Origin device: This is the large slow device that should be accelerated.
  • Cache device: This is the small but fast device that will store the cached data.
  • Metadata device: This is a small device that holds information on what origin data is cached on the cache device.

The Linux Kernel manual describes how to combine devices and create a mapped cache device.

dm-cache constructor table
cache <metadata dev> <cache dev> <origin dev> <block size> <#feature args> [<feature arg>]* <policy> <#policy args> [policy args]*
dmsetup constructor options and arguments
Argument Description
metadata dev fast device holding the persistent metadata
cache dev fast device holding cached data blocks
origin dev slow device holding original data blocks
block size cache unit size in sectors
#feature args number of feature arguments passed
feature args writethrough or passthrough (The default is writeback)
policy the replacement policy to use
#policy args an even number of arguments corresponding to key/value pairs passed to the policy
policy args key/value pairs passed to the policy
Optional feature arguments
Argument Description
writethrough write through caching that prohibits cache block content from being different from origin block content.
writeback write back caching is the default behaviour. Writes are cached and later written back to the origin device for performance reasons.
passthrough a degraded mode useful for various cache coherency situations (e.g., rolling back snapshots of underlying storage). Reads and writes always go to the origin. If a write goes to a cached origin block, then the cache block is invalidated. To enable passthrough mode the cache must be clean.
metadata2 use version 2 of the metadata. This stores the dirty bits in a separate btree, which improves speed of shutting down the cache.
no_discard_passdown disable passing down discards from the cache to the origin's data device.

Setup[edit | edit source]

The dmsetup tool is used to set up the mapping for a dm-cache. It is part of the LVM software stack and is available at https://gitlab.com/lvmteam/lvm2 and manual at https://www.man7.org/linux/man-pages/man8/dmsetup.8.html

dmsetup create device_name --table 'constructor table'

The dm-cache constructor table is a list of options for assembling the cached device.

Let's use the following real-world example to explain the required options.

Syntax: dmsetup create ${name} --table "${startoffset} ${originsize} cache ${metadev} ${cachedev} ${origindev} ${opts}"

# /sbin/dmsetup create 3t_backup --table 0 5852532736 cache /dev/vg_800g/lv_cache_3t_backup_meta /dev/vg_800g/lv_cache_3t_backup_cache /dev/disk/by-partuuid/c810c7b3-064d-4de1-bd5a-1897d1c9b3ad 256 1 writethrough default 0
  • name = 3t_backup: This is the device name of the new accelerated device and will be available as a virtual block device as /dev/mapper/3t_backup.
  • startoffset = 0: Starting offset in sectors of the origin device. Usually 0, which means the beginning.
  • originsize = 5852532736: This is the size of the origin device in sectors. Sector size is usually 512 bytes, but can be 4096k (4Kn drives). 5852532736 sectors x 512 bytes ~ 3TiB
  • cache: This means we are creating a dm-cache device.
  • metadev = /dev/vg_800g/lv_cache_3t_backup_meta: This is a small, 16MB LVM volume, used for metadata.
  • cachedev = /dev/vg_800g/lv_cache_3t_backup_cache: This is a 50GB LVM volume on a SSD.
  • origindev = /dev/disk/by-partuuid/c810c7b3-064d-4de1-bd5a-1897d1c9b3ad: This is a 3TB HDD.
  • 256: Cache block size in sectors. 256 x 512 = 128KiB.
  • 1: Number of feature arguments is 1 (write cache mode)
  • writethrough: writethrough write cache mode. It means all writes are written back to origin device before returning ok to the application.
  • default: Cache migration policy. Default policy is used.
  • 0: Number of policy arguments used. 0 means no further arguments are specified.

It is good to use partuuid instead of /dev/sdf2 because /dev/sdX mappings are not guaranteed to remain fixed across reboots, which could lead to data loss. If whole disks are used instead of partitions, /dev/disk/by-id/ can be used instead.

The partuuid mappings can be found using ls or blkid:

# ls -l /dev/disk/by-partuuid/
lrwxrwxrwx 1 root root 10 Oct 20 15:58 85507ff7-d038-41c3-99b5-8a22bbbfcdce -> ../../sdc1
lrwxrwxrwx 1 root root 10 Oct 20 15:58 ac0ae9b1-8e32-4e33-b641-998bc0298d14 -> ../../sdi1
lrwxrwxrwx 1 root root 10 Oct 20 15:58 bda5c411-5cb2-48ed-b7bf-41e63f8219f2 -> ../../sdg1
lrwxrwxrwx 1 root root 10 Oct 20 15:58 c810c7b3-064d-4de1-bd5a-1897d1c9b3ad -> ../../sdf2
lrwxrwxrwx 1 root root 10 Oct 20 15:58 d017d50e-550f-4627-9222-8d2c8c60a936 -> ../../sdf1
lrwxrwxrwx 1 root root 10 Oct 20 15:58 da237ac0-959d-4ce3-91c5-8f01e50ea65d -> ../../sde2

Once the cache has been created, it can be used as any normal block device. Since there was a filesystem in the origin device it can now be mounted as normal using mount /dev/mapper/3t_backup /media/backup.

# df /media/backup
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/3t_backup  2,9T  2,3T  566G  81% /media/backup
IMPORTANT! The dm-cache has to be re-assembled on every boot as Linux has no built-in way of doing this.

Shell script for creating a dm-cache[edit | edit source]

To make life easier, a small shell script can be used to create the dm-cache with all the appropiate options.

#!/bin/bash

######
# dm-cache setup script
######

# dm-cache name.
name="3t_backup"

# Origin device /dev/sdf2 has partuuid=c810c7b3-064d-4de1-bd5a-1897d1c9b3ad
origindev="/dev/disk/by-partuuid/c810c7b3-064d-4de1-bd5a-1897d1c9b3ad"

# The size of the origin device in sectors.
originsize="$(/sbin/blockdev --getsz ${origindev})"

# The fast cache device. 
cachedev="/dev/vg_800g/lv_cache_3t_backup_cache"

# dm-cache metadata device.
metadev="/dev/vg_800g/lv_cache_3t_backup_meta"

# dm-cache options.
opts="256 1 writethrough default 0"

echo creating dm-cache: ${name}
/sbin/dmsetup create ${name} --table "0 ${originsize} cache ${metadev} ${cachedev} ${origindev} ${opts}"
exit $?

This script could be added to the normal boot process to automatically creatd the cache on boot.

A note on write cache modes[edit | edit source]

In the example above, writethrough mode is used. In this mode only reads are accelerated. In this mode the origin device should always be consistent. If the cachedev device fails, it should still be possible to mount the origin directly.

With writeback mode, writes are cached to the cachedev first before being written to the origin device in the background. Cached data which is not yet written back to the origin device is referred to as dirty data. Severe data loss or corruption would happen if the cache device fails while it still has dirty data! Make sure you have backups of important data.

Statistics and information[edit | edit source]

Some useful tools to gather information of device-mapper devices:

  • dmsetup ls lists known device-mapper devices, including dm-cache and LVM devices.
  • dmsetup info <device> shows health information of a specific device.
# dmsetup info /dev/mapper/3t_backup
Name:              3t_backup
State:             ACTIVE
Read Ahead:        512
Tables present:    LIVE
Open count:        1
Event number:      2609
Major, minor:      254, 20
Number of targets: 1
  • dmsetup status <device> shows statistics and information of device-mapper devices.
# dmsetup status /dev/mapper/3t_backup
0 5852532736 cache 8 1222/4096 256 409598/409600 129884476 100609029 6359209 4746639 29320 29320 0 2 writethrough no_discard_passdown 2 migration_threshold 2048 smq 0 rw -

I wrote a small script to convert the status line into a more readable format, converting sectors and utilisation to bytes and percentages. Although it works in my setup, it does not handle situations with different number of output fields from dmsetup status.

The dmstats.sh together with an OpenRC init script can be downloaded from https://mirrors.tnonline.net/misc/scripts/dmcache/.

# ./dmstats.sh 3t_backup
DEVICE
~~~~~~~~
Device-mapper device:         3t_backup
Origin size:                  2996496760832 bytes
Discards:                     no_discard_passdown

CACHE
~~~~~~~~
Cache Size:                   53687091200 bytes
Cache Usage:                  53686435840 bytes
Cache Usage:                  99 %
Cache Read Hit:               129884487
Cache Read Miss:              100609029
Cache Write Hit:              6359266
Cache Write Miss:             4746710
Cache Dirty:                  0 bytes
Cache Block Size:             131072 bytes
Cache Promotions:             29334
Cache Demotions:              29337
Cache Migration Threshold:    1048576 bytes
Cache RW mode:                rw
Cache Type:                   writethrough
Cache Policy:                 smq
Cache Status:                 OK

METADATA
~~~~~~~~
Metadata Size:                16777216 bytes
Metadata Usage:               5005312 bytes
Metadata Usage:               29 %

Further Reading[edit | edit source]

This has been an introduction to dm-cache, but I encourage everyone to read up more on how to manage dm-cache.

Topics that are not covered in this guide are:

  • Decommissioning a dm-cache.
  • Repairing a degraded dm-cache.
  • Calculating optimal size of the metadata device.
  • Advanced use of policies and migration thresholds.
  • Choosing optimal cache block size for various work loads.

Some of the resources I've researched in to be able to set up dm-cache on my system and write this introduction are.

I would be grateful for any feedback, comments or ideas for improvements on this article.

WARNING 1![edit | edit source]

Using dmsetup to manage cache devices should be left for experienced users. It is a low-level tool to manage Linux Device Mapper and it is easy to make mistakes that leads to complete data loss!

WARNING 2![edit | edit source]

Never use the origin device outside of the dm-cache, even with writethrough mode. If the origin device is modified, its cached data would no longer match, which will lead to data loss or severe filesystem corruptions!

If possible, do use higher level options such as LVM or Bcache as they take care of many of data-loss risks.