Blog/The case for (no) atime on Linux

From Forza's ramblings

The case for (no) atime on Linux[edit | edit source]

a photograph of a Lotus bud with a buddha statue in the background
A budding Lotus flower

Updating the access time metadata for files and directories is a much more costly operation on Btrfs compared to other filesystems. This is because the copy-on-write (COW) design always writes changes into new extents (data blocks) instead of updating data or metadata in-place.

The best way to handle this problem is to simply disable access time updates using the noatime mount option. Almost no software requires atime updates, and those that do can usually use mtime or ctime as reference instead. However if you do have software that really uses the access times, a solution could be to make a separate subvolume for that data and mount it using the atime or relatime mount option.

On Linux, there is file and directory metadata that tracks different types of timestamps. These are:

Attribute Description
atime (Access Time) Indicates the last time a file was accessed or read.
mtime (Modification Time) Indicates when the content of a file was last modified.
ctime (Change Time) Represents the last time a file's metadata (such as permissions or ownership) was changed.
crtime (Creation Time) Reflects the creation time of a file (inode birth time). It is not supported on all filesystems in Linux, but is with Btrfs, ext4 and XFS.

I recently came across a ticket about access time updates on Btrfs. The thread is quite long and detailed, and I think it serves as a good reminder to make sure that one understands how timestamp updates and various Btrfs features work together.

As we already know, Btrfs is a copy-on-write (COW) filesystem. It means it can not update data in-place, but always have to create a new extent (data block) to store the changes. This also applies to metadata updates. A metadata page is normally 16KiB (nodesize mkfs option), and several metadata pages needs to be updated in the Btrfs tree for a single timestamp update. If we consider the situation when there are several snapshots (which are lazy reflinks of whole Btrfs trees) and you run something like find, any shared metadata pages the snapshots use will be unshared because because the metadata is updated. This can lead to a huge burst of data writes, slowing down the filesystem and growing overall metadata usage. In extreme cases, even causing ENOSPC errors that could be difficult to fix.

On the GitHub ticket, user Zygo put up an example of a RAID1 filesystem that has 112 snapshots of a filesystem with close to 2 million files.

Filesystem before test
                              Data      Metadata System
Id Path                       RAID1     RAID1    RAID1     Unallocated
-- -------------------------- --------- -------- --------- -----------
 1 /dev/mapper/devel0617-tvdb 543.00GiB 70.00GiB   8.00MiB   126.59GiB
 2 /dev/mapper/devel0617-tvdc 543.00GiB 70.00GiB   8.00MiB   110.71GiB
-- -------------------------- --------- -------- --------- -----------
   Total                      543.00GiB 70.00GiB   8.00MiB   237.29GiB
   Used                       518.19GiB  8.65GiB 112.00KiB
Filesystem after running find -type f -ls for an hour
                              Data      Metadata System
Id Path                       RAID1     RAID1    RAID1     Unallocated
-- -------------------------- --------- -------- --------- -----------
 1 /dev/mapper/devel0617-tvdb 543.00GiB 92.00GiB   8.00MiB   104.59GiB
 2 /dev/mapper/devel0617-tvdc 543.00GiB 92.00GiB   8.00MiB    88.71GiB
-- -------------------------- --------- -------- --------- -----------
   Total                      543.00GiB 92.00GiB   8.00MiB   193.29GiB
   Used                       518.19GiB 54.26GiB 112.00KiB

We can see that the metadata usage has increased from 8.65GiB to 54.26GiB and the find process has only gone through about 10% of the filesystem at this point.

Zygo continues to explain:

To test latency for threads that want to write to this filesystem, every 5 seconds I run time sh -c 'mkdir test; rmdir test' on the filesystem while the above find command is running. Writes to the filesystem are blocked for 4 seconds every 10 seconds, with a 12-14 second stall every 30 seconds for transaction commit. Every ten minutes or so, there's a longer write stall lasting 2-6 minutes.

The SSDs are being hit with 70 MiB/s of writes continuously, except occasionally when the kernel gets CPU bottlenecked on btrfs workqueues, and IO bandwidth drops to near zero for a while.

With the exception of the mkdir and rmdir, 100% of the workload on this filesystem is reads, i.e. all of the writes come from atime updates.

It is clear that having access time updates on Btrfs is very costly. Unfortunately, the Linux VFS default of relatime is unlikely to change, and changing the on-disk format to make atime updates work differently is, perhaps, even less likely. This is because using the noatime mount option has almost no negative effects for the vast majority of users, and for those that need to use access times, they can opt to mount specific subvolumes with it enabled.

Mount options affecting time stamp updates[edit | edit source]

The Linux VFS mount options relating to time stamps are as follows:

Option Description Default
atime Negates the noatime option and enables updates according to Linux default, which is relatime. Yes
noatime Disables the updating of atime and diratime. No
strictatime Enables atime updates. No
nostrictatime Do not use strict atime. No
relatime Only update atime if the previous atime is older than the mtime or ctime, or if is more than one day old. Yes
lazytime Delays time stamp updates to atime, mtime, and ctime. The on-disk timestamps are updated only when:
  • the inode needs to be updated for some change unrelated to file timestamps.
  • the application employs fsync.
  • an undeleted inode is evicted from memory.
  • more than 24 hours have passed since the inode was written to disk.
nolazytime Do not use lazy time. No
diratime Similar to atime but specifically applies to directories. Yes
nodiratime Disables atime updates for directories only. No
Note: With lazytime, Btrfs does not guarantee that updated timestamps not yet written to disk are preserved after a crash. A situation can happen where file data is updated but its ctime or mtime is not. This may be a serious issue if software relies on mtime. Don't use lazytime unless you understand the consequences.