From Forza's ramblings

Introduction[edit | edit source]

Rendered image of stacked hard drive platters looking like a tower in a cityscape
AI interpretation of Linux Device Mapper Cache devices

Imagine you have multiple applications running on your computer, and you want to ensure that one application doesn't consume all the CPU power or memory, leaving the others struggling for resources. This is where cgroups come into play. Cgroups help you set limits and control the usage of resources for different processes.

Linux Control Groups, v2[edit | edit source]

Linux Control Groups, commonly referred to as cgroups, is a feature in the Linux kernel that provides a way to organize and manage system resources for processes. It allows you to allocate and limit resources like CPU, memory, and I/O bandwidth among different processes or groups of processes.

There are two versions of Control Groups. The original v1 and the new 'unified' v2.

With cgroup v1, the support for multiple hierarchies posed challenges, limiting flexibility and leading to complex configurations. Controllers like the freezer were confined to a single hierarchy, and once bound, controllers couldn't be moved. This inflexibility resulted in managing numerous hierarchies separately, hindering cooperation between controllers. In contrast, cgroup v2 adopts a unified hierarchy approach, addressing these issues with a more practical and streamlined configuration management.

NOTE. Unless stated otherwise, this wiki page soley focuses on the use of Control Groups v2.

The usual way of prioritising CPU between processes is to use the traditional nice tool to set a process's priority between -20 and 19, where -20 is highest priority and 19 the lowest.

With cgroups, it is possible to create a hierarchy and assign limits, not only CPU priorities (weight), but also I/O, memory and other types of limits, to each level and cgroup. Nested cgroups are bound within their parents limits, which can make cgroups a powerful tool to control system resources.

Linux Control Groups have no pre-defined names. Distributions may choose their own names, though you may also use your own naming scheme and hierarchical structure.

A cgroups can be created or removed using mkdir and rmdir.

# mkdir /sys/fs/cgroup/my-group
# mkdir /sys/fs/cgroup/my-group/group-a
# mkdir /sys/fs/cgroup/my-group/group-b
# rmdir /sys/fs/cgroup/my-group/group-a

If the cgroup is not empty (no pids and no children), it cannot be removed.

# rmdir /sys/fs/cgroup/my-group/
rmdir: failed to remove 'my-group': Device or resource busy

Here's an example showing how CPU time is shared based on each group's cpu.weight. Weights are only enforced when there is contention. If only once process asks for 100% CPU, it will have it until other processes start to compete. The diagram below shows how the CPU time (bandwidth) would be shared if each cgroup tries to use maximum CPU.

ROOT (usually '/sys/fs/cgroup')
├── user                   # cpu.weight: 100, effective cpu time share (40%)
│   ├── user-1000          # cpu.weight: 100, effective cpu time share (20%)
│   └── user-1001          # cpu.weight: 100, effective cpu time share (20%)
├── cgroup-1               # cpu.weight: 50, effective cpu time share (20%)
│   ├── cgroup-A           # cpu.weight: 200, effective cpu time share (6.67%)
│   └── cgroup-B           # cpu.weight: 400, effective cpu time share (13.33%)
└── cgroup-2               # cpu.weight: 100, effective cpu time share (40%)

cgroup controllers[edit | edit source]

A controller is responsible for managing a type of resource. The available controllers can be listed via the cgroups.controllers file

# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
Controller Description
cpu The "cpu" controllers regulates distribution of CPU cycles.
cpuset The "cpuset" controller provides a mechanism for constraining the CPU and memory node placement of tasks. Especially useful on NUMA systems.
memory The "memory" controller regulates distribution of memory. It also tracks file cache, kernel memory and TCP sockets that a process may use.
io The "io" controller regulates the distribution of IO resources. This controller implements both weight based and absolute bandwidth or IOPS limit distribution. Note that this works CFS and BFQ IO schedulers but not deadline or noop.
hugetlb The "hugetlb" controller allows to limit the HugeTLB usage per control group.
pids The process number controller is used to allow a cgroup to stop any new tasks from being fork()'d or clone()'d after a specified limit is reached.
rdma The "rdma" controller regulates the distribution and accounting of RDMA resources.
misc The Miscellaneous cgroup provides the resource limiting and tracking for resources which cannot be abstracted like the other cgroup resources.

The number of controllers that are available depends on the kernel used. A full description of each controller and how to configure it is available in the Linux kernel documentation

cgroup interface files[edit | edit source]

A cgroup controller can be configured using its interface file.

Interface File Description
cgroup.controllers A list of controllers enabled for the cgroup. Events recorded for the cgroup.
cgroup.freeze Freezes or unfreezes the tasks in the cgroup.
cgroup.kill Writing "1" to the file causes the cgroup and all descendant cgroups to be killed.
cgroup.max.depth Maximum depth of the cgroup hierarchy.
cgroup.max.descendants Maximum number of descendants a cgroup can have.
cgroup.pressure Resource pressure events for the cgroup.
cgroup.procs List of process IDs in the cgroup.
cgroup.stat Statistics for the cgroup.
cgroup.subtree_control A list of controllers enabled for child cgroups.
cgroup.threads List of thread IDs in the cgroup.
cgroup.type Type of cgroup.
cpu.idle CPU idle scheduler.
cpu.max Maximum bandwidth limit for the CPU.
cpu.max.burst Maximum burst bandwidth for the CPU.
cpu.pressure CPU pressure events.
cpuset.cpus CPUs assigned to the cpuset.
cpuset.cpus.effective Effective CPUs in the cpuset.
cpuset.cpus.partition Partitioned CPUs in the cpuset.
cpuset.mems Memory nodes assigned to the cpuset.
cpuset.mems.effective Effective memory nodes in the cpuset.
cpu.stat CPU statistics.
cpu.stat.local Throttled time for individual children cgroups.
cpu.weight CPU bandwidth weight.
cpu.weight.nice Nice-adjusted CPU bandwidth weight.
hugetlb.1GB.current Current usage of 1GB huge pages. Events for 1GB huge pages. Local events for 1GB huge pages.
hugetlb.1GB.max Maximum limit for 1GB huge pages.
hugetlb.1GB.numa_stat NUMA statistics for 1GB huge pages.
hugetlb.1GB.rsvd.current Current reserved 1GB huge pages.
hugetlb.1GB.rsvd.max Maximum reserved 1GB huge pages.
hugetlb.2MB.current Current usage of 2MB huge pages. Events for 2MB huge pages. Local events for 2MB huge pages.
hugetlb.2MB.max Maximum limit for 2MB huge pages.
hugetlb.2MB.numa_stat NUMA statistics for 2MB huge pages.
hugetlb.2MB.rsvd.current Current reserved 2MB huge pages.
hugetlb.2MB.rsvd.max Maximum reserved 2MB huge pages.
io.latency I/O latency.
io.max Maximum bandwidth limit for I/O.
io.pressure I/O pressure events.
io.prio.class I/O priority class.
io.stat I/O statistics.
io.weight I/O bandwidth weight.
io.bfq.weight Weight for the BFQ I/O scheduler.
memory.current Current memory usage. Memory events. Local memory events.
memory.high High memory usage threshold.
memory.low Low memory usage threshold.
memory.max Maximum limit for memory usage.
memory.min Minimum limit for memory usage.
memory.numa_stat NUMA statistics for memory. OOM control for memory.
memory.peak Peak memory usage.
memory.pressure Memory pressure events.
memory.reclaim Memory reclaim events.
memory.stat Memory statistics.
memory.swap.current Current swap usage. Swap events.
memory.swap.high High swap usage threshold.
memory.swap.max Maximum limit for swap usage.
memory.swap.peak Peak swap usage.
memory.zswap.current Current zswap usage.
memory.zswap.max Maximum limit for zswap usage.
misc.current Current usage for miscellaneous resources. Events for miscellaneous resources.
misc.max Maximum limit for miscellaneous resources.
pids.current Current number of processes. Process events.
pids.max Maximum limit for the number of processes.
pids.peak Peak number of processes.
rdma.current Current usage for RDMA resources.
rdma.max Maximum limit for RDMA resources.

mount options[edit | edit source]

Linux Control Groups are accessed through the cgroup2 filesystem.

# mount -t cgroup2 none <MOUNT_POINT>

Many Linux distributions automatically mount it at /sys/fs/cgroup or /sys/fs/cgroup/unified.

cgroup v2 currently supports the following mount options:

mount option Description
nsdelegate Consider cgroup namespaces as delegation boundaries. This option is system wide and can only be set on mount or modified through remount from the init namespace.
favordynmods Reduce the latencies of dynamic cgroup modifications such as task migrations and controller on/offs at the cost of making hot path operations such as forks and exits more expensive.
memory_localevents Only populate with data for the current cgroup, and not any subtrees. This is legacy behaviour, the default behaviour without this option is to include subtree counts.
memory_recursiveprot Recursively apply memory.min and memory.low protection to entire subtrees, without requiring explicit downward propagation into leaf cgroups. This allows protecting entire subtrees from one another, while retaining free competition within those subtrees. This should have been the default behavior but is a mount-option to avoid regressing setups relying on the original semantics (e.g. specifying bogusly high 'bypass' protection values at higher tree levels)
memory_hugetlb_accounting Count HugeTLB memory usage towards the cgroup's overall memory usage for the memory controller (for the purpose of statistics reporting and memory protetion).

cgexec - Execute a Command in a cgroup[edit | edit source]

cgexec is a Bash script I wrote that allows users to execute a command within a cgroup, providing control over resource limits such as CPU, I/O, and memory. It requires Linux Control Groups (cgroups) version v2, which is also known as unified cgroups.

The package libcgroup also provides a cgexec command. It is also used to control cgroup hierarchies, but uses a more complex setup and methods for assigning processes.

Usage[edit | edit source]

# cgexec -h
Attaches a program <cmd> to a cgroup with defined limits.
Requires Linux Control Groups v2.
Usage: cgexec [options] <cmd> [cmd args]
 -c cpu.weight   (0-10000)  Set CPU priority
 -C cpu.max      (1-100)    Set max CPU time in percent
 -i io.weight    (1-10000)  Set I/O weight
 -m memory.high  (0-max)    Set soft memory limit
 -M memory.max   (0-max)    Set hard memory limit
 -g group        Create or attach to existing cgroup. Default is to use an ephemeral group
 -b path         Use <path> as cgroup root
Option Description
cpu.weight Set CPU priority. 1-10000 where 10000 is highest priority, similar to nice -n -19. 0 is special and enables the idle CPU scheduler.
cpu.max Sets the maximum allowed CPU time in percentages. A value less than 100 will throttle the process periodically.
io.weight Sets the I/O priority. It is similar to CPU weight and limits the bandwidth available to a process if there is contention.
memory.high Sets a soft memory limit. If a process tries to allocate more, it will be throttled instead of cause OOM.
memory.max Sets an absolute limit to how much memory a process can allocate.
group Use an existing cgroup, or create it if it doesn't exist. Default is to use a temporary cgroup named cmd-xxxx where xxxx is a random string.
path Use a specific cgroup root. Can be a nested cgroup. Useful if you want to attach the process to an exusting cgroup hierarchy.

Examples[edit | edit source]

Execute a command with default settings

cgexec echo "Hello, cgroups!"

Limit CPU and memory for a command

cgexec -c 50 -m 1G my_command

Attach to an existing cgroup

cgexec -g mygroup my_command

Use a custom cgroup root

cgexec -b /sys/fs/cgroup/mysubsystem my_command

User Cgroups[edit | edit source]

While it is possible to give a user ownership of a cgroup, the user cannot directly use it because all processes initially belong to the cgroup root /sys/fs/cgroup/cgroup.procs, and even though a user can write to its own cgroup, they can not remove themselves from the root cgroup.

This catch-22 can be solved by using sudo or a root user to create a cgroup, change the owner to a user and then move that user's processes to it.

One solution is to let the user start a screen or tmux session and then as root, check what pids belong to the user session and add them to the user cgroup:

# ps af -u forza
5309 pts/0    S+     0:00  |     \_ screen
5310 ?        Ss     0:00  |         \_ SCREEN
5311 pts/1    Ss     0:00  |             \_ -/bin/bash
5483 pts/1    R+     0:00  |                 \_ ps af -u forza

# echo 5309 > /sys/fs/cgroup/user/forza/main/cgroup.procs
# echo 5310 > /sys/fs/cgroup/user/forza/main/cgroup.procs
# echo 5311 > /sys/fs/cgroup/user/forza/main/cgroup.procs

Now, any process started by that user will automatically belong to the same cgroup.

The user can now also use cgexec to create nested cgroups under its own cgroup.