Blog/The case for (no) atime on Linux
The case for (no) atime on Linux[edit | edit source]
Updating the access time metadata for files and directories is a much more costly operation on Btrfs compared to other filesystems. This is because the copy-on-write (COW) design always writes changes into new extents (data blocks) instead of updating data or metadata in-place.
The best way to handle this problem is to simply disable access time updates using the noatime
mount option. Almost no software requires atime updates, and those that do can usually use mtime or ctime as reference instead. However if you do have software that really uses the access times, a solution could be to make a separate subvolume for that data and mount it using the atime
or relatime
mount option.
On Linux, there is file and directory metadata that tracks different types of timestamps. These are:
Attribute | Description |
---|---|
atime (Access Time) | Indicates the last time a file was accessed or read. |
mtime (Modification Time) | Indicates when the content of a file was last modified. |
ctime (Change Time) | Represents the last time a file's metadata (such as permissions or ownership) was changed. |
crtime (Creation Time) | Reflects the creation time of a file (inode birth time). It is not supported on all filesystems in Linux, but is with Btrfs, ext4 and XFS. |
I recently came across a ticket about access time updates on Btrfs. The thread is quite long and detailed, and I think it serves as a good reminder to make sure that one understands how timestamp updates and various Btrfs features work together.
As we already know, Btrfs is a copy-on-write (COW) filesystem. It means it can not update data in-place, but always have to create a new extent (data block) to store the changes. This also applies to metadata updates. A metadata page is normally 16KiB (nodesize mkfs option), and several metadata pages needs to be updated in the Btrfs tree for a single timestamp update. If we consider the situation when there are several snapshots (which are lazy reflinks of whole Btrfs trees) and you run something like find
, any shared metadata pages the snapshots use will be unshared because because the metadata is updated. This can lead to a huge burst of data writes, slowing down the filesystem and growing overall metadata usage. In extreme cases, even causing ENOSPC errors that could be difficult to fix.
On the GitHub ticket, user Zygo put up an example of a RAID1 filesystem that has 112 snapshots of a filesystem with close to 2 million files.
Filesystem before test
Data Metadata System Id Path RAID1 RAID1 RAID1 Unallocated -- -------------------------- --------- -------- --------- ----------- 1 /dev/mapper/devel0617-tvdb 543.00GiB 70.00GiB 8.00MiB 126.59GiB 2 /dev/mapper/devel0617-tvdc 543.00GiB 70.00GiB 8.00MiB 110.71GiB -- -------------------------- --------- -------- --------- ----------- Total 543.00GiB 70.00GiB 8.00MiB 237.29GiB Used 518.19GiB 8.65GiB 112.00KiB
Filesystem after running find -type f -ls
for an hour
Data Metadata System Id Path RAID1 RAID1 RAID1 Unallocated -- -------------------------- --------- -------- --------- ----------- 1 /dev/mapper/devel0617-tvdb 543.00GiB 92.00GiB 8.00MiB 104.59GiB 2 /dev/mapper/devel0617-tvdc 543.00GiB 92.00GiB 8.00MiB 88.71GiB -- -------------------------- --------- -------- --------- ----------- Total 543.00GiB 92.00GiB 8.00MiB 193.29GiB Used 518.19GiB 54.26GiB 112.00KiB
We can see that the metadata usage has increased from 8.65GiB to 54.26GiB and the find process has only gone through about 10% of the filesystem at this point.
Zygo continues to explain:
To test latency for threads that want to write to this filesystem, every 5 seconds I run time sh -c 'mkdir test; rmdir test' on the filesystem while the above find command is running. Writes to the filesystem are blocked for 4 seconds every 10 seconds, with a 12-14 second stall every 30 seconds for transaction commit. Every ten minutes or so, there's a longer write stall lasting 2-6 minutes.
The SSDs are being hit with 70 MiB/s of writes continuously, except occasionally when the kernel gets CPU bottlenecked on btrfs workqueues, and IO bandwidth drops to near zero for a while.
With the exception of the mkdir and rmdir, 100% of the workload on this filesystem is reads, i.e. all of the writes come from atime updates.
It is clear that having access time updates on Btrfs is very costly. Unfortunately, the Linux VFS default of relatime
is unlikely to change, and changing the on-disk format to make atime updates work differently is, perhaps, even less likely. This is because using the noatime
mount option has almost no negative effects for the vast majority of users, and for those that need to use access times, they can opt to mount specific subvolumes with it enabled.
Mount options affecting time stamp updates[edit | edit source]
The Linux VFS mount options relating to time stamps are as follows:
Option | Description | Default |
---|---|---|
atime | Negates the noatime option and enables updates according to Linux default, which is relatime. | Yes |
noatime | Disables the updating of atime and diratime. | No |
strictatime | Enables atime updates. | No |
nostrictatime | Do not use strict atime. | No |
relatime | Only update atime if the previous atime is older than the mtime or ctime, or if is more than one day old. | Yes |
lazytime | Delays time stamp updates to atime, mtime, and ctime. The on-disk timestamps are updated only when:
|
No |
nolazytime | Do not use lazy time. | No |
diratime | Similar to atime but specifically applies to directories. | Yes |
nodiratime | Disables atime updates for directories only. | No |
lazytime
, Btrfs does not guarantee that updated timestamps not yet written to disk are preserved after a crash. A situation can happen where file data is updated but its ctime
or mtime
is not. This may be a serious issue if software relies on mtime. Don't use lazytime
unless you understand the consequences.