Blog/dm-cache: Linux Accelerated Storage
DM Cache: Linux Tiered Storage[edit | edit source]
There are several technologies for combining large, but slow HDDs with a small but fast flash-based storage to improve the overall performance. Some common options include:
- Bcache: Bcache is a Linux kernel block layer cache that can use an SSD as a cache device for a slower HDD. It's known for its ease of use and seamless integration.
- LVM Cache: The Logical Volume Manager (LVM) provides a cache feature that allows you to use an SSD as a cache device for HDDs managed by LVM.
- dm-cache: Linux dm-cache, or Device Mapper Cache, is another option in this category. It is similar to Bcache but is a more low-level option built on Device Mapper.
If you were building a storage stack from scratch, using empty devices, it is probably a good idea to use LVM or Bcache. Bcache is the easiest to set up using the bcache-tools package, while LVM provides the most flexibility. If you are faliliar with LVM or already use LVM, then LVM cache is a good choice.
Why dm-cache?[edit | edit source]
Like Bcache and LVM cache, the goal with dm-cache is to improve random read write performance of the slow HDD by using small but fast SSD or NVME device.
The main advantage of dm-cache is that it is possible to setup on devices that already have a filesystem with data on them. Both LVM and Bcache requires unformatted, empty devices (there are ways to get around, but can be risky).
Compared to LVM and Bcache, dm-cache is the most low-level option. It is managed using the dmsetup
tool. Being a low-level solution, the setup and management can be quite difficult to grasp.
dm-cache[edit | edit source]
dm-cache requires tree devices:
- Origin device: This is the large slow device that should be accelerated.
- Cache device: This is the small but fast device that will store the cached data.
- Metadata device: This is a small device that holds information on what origin data is cached on the cache device.
The Linux Kernel manual describes how to combine devices and create a mapped cache device.
dm-cache constructor table
cache <metadata dev> <cache dev> <origin dev> <block size> <#feature args> [<feature arg>]* <policy> <#policy args> [policy args]*
Argument | Description |
---|---|
metadata dev | fast device holding the persistent metadata |
cache dev | fast device holding cached data blocks |
origin dev | slow device holding original data blocks |
block size | cache unit size in sectors |
#feature args | number of feature arguments passed |
feature args | writethrough or passthrough (The default is writeback) |
policy | the replacement policy to use |
#policy args | an even number of arguments corresponding to key/value pairs passed to the policy |
policy args | key/value pairs passed to the policy |
Argument | Description |
---|---|
writethrough | write through caching that prohibits cache block content from being different from origin block content. |
writeback | write back caching is the default behaviour. Writes are cached and later written back to the origin device for performance reasons. |
passthrough | a degraded mode useful for various cache coherency situations (e.g., rolling back snapshots of underlying storage). Reads and writes always go to the origin. If a write goes to a cached origin block, then the cache block is invalidated. To enable passthrough mode the cache must be clean. |
metadata2 | use version 2 of the metadata. This stores the dirty bits in a separate btree, which improves speed of shutting down the cache. |
no_discard_passdown | disable passing down discards from the cache to the origin's data device. |
Setup[edit | edit source]
The dmsetup
tool is used to set up the mapping for a dm-cache. It is part of the LVM software stack and is available at https://gitlab.com/lvmteam/lvm2 and manual at https://www.man7.org/linux/man-pages/man8/dmsetup.8.html
dmsetup create device_name --table 'constructor table'
The dm-cache constructor table
is a list of options for assembling the cached device.
Let's use the following real-world example to explain the required options.
Syntax: dmsetup create ${name} --table "${startoffset} ${originsize} cache ${metadev} ${cachedev} ${origindev} ${opts}"
# /sbin/dmsetup create 3t_backup --table 0 5852532736 cache /dev/vg_800g/lv_cache_3t_backup_meta /dev/vg_800g/lv_cache_3t_backup_cache /dev/disk/by-partuuid/c810c7b3-064d-4de1-bd5a-1897d1c9b3ad 256 1 writethrough default 0
- name = 3t_backup: This is the device name of the new accelerated device and will be available as a virtual block device as
/dev/mapper/3t_backup
. - startoffset = 0: Starting offset in sectors of the origin device. Usually
0
, which means the beginning. - originsize = 5852532736: This is the size of the origin device in sectors. Sector size is usually 512 bytes, but can be 4096k (4Kn drives). 5852532736 sectors x 512 bytes ~ 3TiB
- cache: This means we are creating a dm-cache device.
- metadev = /dev/vg_800g/lv_cache_3t_backup_meta: This is a small, 16MB LVM volume, used for metadata.
- cachedev = /dev/vg_800g/lv_cache_3t_backup_cache: This is a 50GB LVM volume on a SSD.
- origindev = /dev/disk/by-partuuid/c810c7b3-064d-4de1-bd5a-1897d1c9b3ad: This is a 3TB HDD.
- 256: Cache block size in sectors. 256 x 512 = 128KiB.
- 1: Number of feature arguments is 1 (write cache mode)
- writethrough:
writethrough
write cache mode. It means all writes are written back to origin device before returning ok to the application. - default: Cache migration policy. Default policy is used.
- 0: Number of policy arguments used. 0 means no further arguments are specified.
It is good to use partuuid
instead of /dev/sdf2
because /dev/sdX mappings are not guaranteed to remain fixed across reboots, which could lead to data loss. If whole disks are used instead of partitions, /dev/disk/by-id/
can be used instead.
The partuuid mappings can be found using ls
or blkid
:
# ls -l /dev/disk/by-partuuid/
lrwxrwxrwx 1 root root 10 Oct 20 15:58 85507ff7-d038-41c3-99b5-8a22bbbfcdce -> ../../sdc1 lrwxrwxrwx 1 root root 10 Oct 20 15:58 ac0ae9b1-8e32-4e33-b641-998bc0298d14 -> ../../sdi1 lrwxrwxrwx 1 root root 10 Oct 20 15:58 bda5c411-5cb2-48ed-b7bf-41e63f8219f2 -> ../../sdg1 lrwxrwxrwx 1 root root 10 Oct 20 15:58 c810c7b3-064d-4de1-bd5a-1897d1c9b3ad -> ../../sdf2 lrwxrwxrwx 1 root root 10 Oct 20 15:58 d017d50e-550f-4627-9222-8d2c8c60a936 -> ../../sdf1 lrwxrwxrwx 1 root root 10 Oct 20 15:58 da237ac0-959d-4ce3-91c5-8f01e50ea65d -> ../../sde2
Once the cache has been created, it can be used as any normal block device. Since there was a filesystem in the origin device it can now be mounted as normal using mount /dev/mapper/3t_backup /media/backup
.
# df /media/backup
Filesystem Size Used Avail Use% Mounted on /dev/mapper/3t_backup 2,9T 2,3T 566G 81% /media/backup
Shell script for creating a dm-cache[edit | edit source]
To make life easier, a small shell script can be used to create the dm-cache with all the appropiate options.
#!/bin/bash
######
# dm-cache setup script
######
# dm-cache name.
name="3t_backup"
# Origin device /dev/sdf2 has partuuid=c810c7b3-064d-4de1-bd5a-1897d1c9b3ad
origindev="/dev/disk/by-partuuid/c810c7b3-064d-4de1-bd5a-1897d1c9b3ad"
# The size of the origin device in sectors.
originsize="$(/sbin/blockdev --getsz ${origindev})"
# The fast cache device.
cachedev="/dev/vg_800g/lv_cache_3t_backup_cache"
# dm-cache metadata device.
metadev="/dev/vg_800g/lv_cache_3t_backup_meta"
# dm-cache options.
opts="256 1 writethrough default 0"
echo creating dm-cache: ${name}
/sbin/dmsetup create ${name} --table "0 ${originsize} cache ${metadev} ${cachedev} ${origindev} ${opts}"
exit $?
This script could be added to the normal boot process to automatically creatd the cache on boot.
A note on write cache modes[edit | edit source]
In the example above, writethrough mode is used. In this mode only reads are accelerated. In this mode the origin device should always be consistent. If the cachedev device fails, it should still be possible to mount the origin directly.
With writeback mode, writes are cached to the cachedev first before being written to the origin device in the background. Cached data which is not yet written back to the origin device is referred to as dirty
data. Severe data loss or corruption would happen if the cache device fails while it still has dirty data! Make sure you have backups of important data.
Statistics and information[edit | edit source]
Some useful tools to gather information of device-mapper devices:
dmsetup ls
lists known device-mapper devices, including dm-cache and LVM devices.dmsetup info <device>
shows health information of a specific device.
# dmsetup info /dev/mapper/3t_backup
Name: 3t_backup State: ACTIVE Read Ahead: 512 Tables present: LIVE Open count: 1 Event number: 2609 Major, minor: 254, 20 Number of targets: 1
dmsetup status <device>
shows statistics and information of device-mapper devices.
# dmsetup status /dev/mapper/3t_backup
0 5852532736 cache 8 1222/4096 256 409598/409600 129884476 100609029 6359209 4746639 29320 29320 0 2 writethrough no_discard_passdown 2 migration_threshold 2048 smq 0 rw -
I wrote a small script to convert the status line into a more readable format, converting sectors and utilisation to bytes and percentages. Although it works in my setup, it does not handle situations with different number of output fields from dmsetup status
.
The dmstats.sh together with an OpenRC init script can be downloaded from https://mirrors.tnonline.net/misc/scripts/dmcache/.
# ./dmstats.sh 3t_backup
DEVICE ~~~~~~~~ Device-mapper device: 3t_backup Origin size: 2996496760832 bytes Discards: no_discard_passdown CACHE ~~~~~~~~ Cache Size: 53687091200 bytes Cache Usage: 53686435840 bytes Cache Usage: 99 % Cache Read Hit: 129884487 Cache Read Miss: 100609029 Cache Write Hit: 6359266 Cache Write Miss: 4746710 Cache Dirty: 0 bytes Cache Block Size: 131072 bytes Cache Promotions: 29334 Cache Demotions: 29337 Cache Migration Threshold: 1048576 bytes Cache RW mode: rw Cache Type: writethrough Cache Policy: smq Cache Status: OK METADATA ~~~~~~~~ Metadata Size: 16777216 bytes Metadata Usage: 5005312 bytes Metadata Usage: 29 %
Further Reading[edit | edit source]
This has been an introduction to dm-cache, but I encourage everyone to read up more on how to manage dm-cache.
Topics that are not covered in this guide are:
- Decommissioning a dm-cache.
- Repairing a degraded dm-cache.
- Calculating optimal size of the metadata device.
- Advanced use of policies and migration thresholds.
- Choosing optimal cache block size for various work loads.
Some of the resources I've researched in to be able to set up dm-cache on my system and write this introduction are.
- dmsetup man page: https://www.man7.org/linux/man-pages/man8/dmsetup.8.html
- Kernel Admin-guide: https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/cache.html
- dmsetup source code: https://gitlab.com/lvmteam/lvm2/-/blob/main/libdm/dm-tools/dmsetup.c
- Linux Kernel source code: https://github.com/torvalds/linux/tree/master/drivers/md
- Wikipedia: https://en.m.wikipedia.org/wiki/Dm-cache
- https://blog.kylemanna.com/linux/ssd-caching-using-dmcache-tutorial/
I would be grateful for any feedback, comments or ideas for improvements on this article.
WARNING 1![edit | edit source]
Using dmsetup to manage cache devices should be left for experienced users. It is a low-level tool to manage Linux Device Mapper and it is easy to make mistakes that leads to complete data loss!
WARNING 2![edit | edit source]
Never use the origin device outside of the dm-cache, even with writethrough
mode. If the origin device is modified, its cached data would no longer match, which will lead to data loss or severe filesystem corruptions!
If possible, do use higher level options such as LVM or Bcache as they take care of many of data-loss risks.