Linux/cgexec
Introduction[edit | edit source]
Imagine you have multiple applications running on your computer, and you want to ensure that one application doesn't consume all the CPU power or memory, leaving the others struggling for resources. This is where cgroups come into play. Cgroups help you set limits and control the usage of resources for different processes.
Linux Control Groups, v2[edit | edit source]
Linux Control Groups, commonly referred to as cgroups, is a feature in the Linux kernel that provides a way to organize and manage system resources for processes. It allows you to allocate and limit resources like CPU, memory, and I/O bandwidth among different processes or groups of processes.
There are two versions of Control Groups. The original v1 and the new 'unified' v2.
With cgroup v1, the support for multiple hierarchies posed challenges, limiting flexibility and leading to complex configurations. Controllers like the freezer were confined to a single hierarchy, and once bound, controllers couldn't be moved. This inflexibility resulted in managing numerous hierarchies separately, hindering cooperation between controllers. In contrast, cgroup v2 adopts a unified hierarchy approach, addressing these issues with a more practical and streamlined configuration management.
The usual way of prioritising CPU between processes is to use the traditional nice
tool to set a process's priority between -20 and 19, where -20 is highest priority and 19 the lowest.
With cgroups, it is possible to create a hierarchy and assign limits, not only CPU priorities (weight), but also I/O, memory and other types of limits, to each level and cgroup. Nested cgroups are bound within their parents limits, which can make cgroups a powerful tool to control system resources.
Linux Control Groups have no pre-defined names. Distributions may choose their own names, though you may also use your own naming scheme and hierarchical structure.
A cgroups can be created or removed using mkdir
and rmdir
.
# mkdir /sys/fs/cgroup/my-group # mkdir /sys/fs/cgroup/my-group/group-a # mkdir /sys/fs/cgroup/my-group/group-b # rmdir /sys/fs/cgroup/my-group/group-a
If the cgroup is not empty (no pids and no children), it cannot be removed.
# rmdir /sys/fs/cgroup/my-group/ rmdir: failed to remove 'my-group': Device or resource busy
Here's an example showing how CPU time is shared based on each group's cpu.weight
. Weights are only enforced when there is contention. If only once process asks for 100% CPU, it will have it until other processes start to compete. The diagram below shows how the CPU time (bandwidth) would be shared if each cgroup tries to use maximum CPU.
ROOT (usually '/sys/fs/cgroup') ├── user # cpu.weight: 100, effective cpu time share (40%) │ ├── user-1000 # cpu.weight: 100, effective cpu time share (20%) │ └── user-1001 # cpu.weight: 100, effective cpu time share (20%) ├── cgroup-1 # cpu.weight: 50, effective cpu time share (20%) │ ├── cgroup-A # cpu.weight: 200, effective cpu time share (6.67%) │ └── cgroup-B # cpu.weight: 400, effective cpu time share (13.33%) └── cgroup-2 # cpu.weight: 100, effective cpu time share (40%)
cgroup controllers[edit | edit source]
A controller is responsible for managing a type of resource. The available controllers can be listed via the cgroups.controllers
file
# cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
Controller | Description |
---|---|
cpu | The "cpu" controllers regulates distribution of CPU cycles. |
cpuset | The "cpuset" controller provides a mechanism for constraining the CPU and memory node placement of tasks. Especially useful on NUMA systems. |
memory | The "memory" controller regulates distribution of memory. It also tracks file cache, kernel memory and TCP sockets that a process may use. |
io | The "io" controller regulates the distribution of IO resources. This controller implements both weight based and absolute bandwidth or IOPS limit distribution. Note that this works CFS and BFQ IO schedulers but not deadline or noop. |
hugetlb | The "hugetlb" controller allows to limit the HugeTLB usage per control group. |
pids | The process number controller is used to allow a cgroup to stop any new tasks from being fork()'d or clone()'d after a specified limit is reached. |
rdma | The "rdma" controller regulates the distribution and accounting of RDMA resources. |
misc | The Miscellaneous cgroup provides the resource limiting and tracking for resources which cannot be abstracted like the other cgroup resources. |
The number of controllers that are available depends on the kernel used. A full description of each controller and how to configure it is available in the Linux kernel documentation https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
cgroup interface files[edit | edit source]
A cgroup controller can be configured using its interface file.
Interface File | Description |
---|---|
cgroup.controllers | A list of controllers enabled for the cgroup. |
cgroup.events | Events recorded for the cgroup. |
cgroup.freeze | Freezes or unfreezes the tasks in the cgroup. |
cgroup.kill | Writing "1" to the file causes the cgroup and all descendant cgroups to be killed. |
cgroup.max.depth | Maximum depth of the cgroup hierarchy. |
cgroup.max.descendants | Maximum number of descendants a cgroup can have. |
cgroup.pressure | Resource pressure events for the cgroup. |
cgroup.procs | List of process IDs in the cgroup. |
cgroup.stat | Statistics for the cgroup. |
cgroup.subtree_control | A list of controllers enabled for child cgroups. |
cgroup.threads | List of thread IDs in the cgroup. |
cgroup.type | Type of cgroup. |
cpu.idle | CPU idle scheduler. |
cpu.max | Maximum bandwidth limit for the CPU. |
cpu.max.burst | Maximum burst bandwidth for the CPU. |
cpu.pressure | CPU pressure events. |
cpuset.cpus | CPUs assigned to the cpuset. |
cpuset.cpus.effective | Effective CPUs in the cpuset. |
cpuset.cpus.partition | Partitioned CPUs in the cpuset. |
cpuset.mems | Memory nodes assigned to the cpuset. |
cpuset.mems.effective | Effective memory nodes in the cpuset. |
cpu.stat | CPU statistics. |
cpu.stat.local | Throttled time for individual children cgroups. |
cpu.weight | CPU bandwidth weight. |
cpu.weight.nice | Nice-adjusted CPU bandwidth weight. |
hugetlb.1GB.current | Current usage of 1GB huge pages. |
hugetlb.1GB.events | Events for 1GB huge pages. |
hugetlb.1GB.events.local | Local events for 1GB huge pages. |
hugetlb.1GB.max | Maximum limit for 1GB huge pages. |
hugetlb.1GB.numa_stat | NUMA statistics for 1GB huge pages. |
hugetlb.1GB.rsvd.current | Current reserved 1GB huge pages. |
hugetlb.1GB.rsvd.max | Maximum reserved 1GB huge pages. |
hugetlb.2MB.current | Current usage of 2MB huge pages. |
hugetlb.2MB.events | Events for 2MB huge pages. |
hugetlb.2MB.events.local | Local events for 2MB huge pages. |
hugetlb.2MB.max | Maximum limit for 2MB huge pages. |
hugetlb.2MB.numa_stat | NUMA statistics for 2MB huge pages. |
hugetlb.2MB.rsvd.current | Current reserved 2MB huge pages. |
hugetlb.2MB.rsvd.max | Maximum reserved 2MB huge pages. |
io.latency | I/O latency. |
io.max | Maximum bandwidth limit for I/O. |
io.pressure | I/O pressure events. |
io.prio.class | I/O priority class. |
io.stat | I/O statistics. |
io.weight | I/O bandwidth weight. |
io.bfq.weight | Weight for the BFQ I/O scheduler. |
memory.current | Current memory usage. |
memory.events | Memory events. |
memory.events.local | Local memory events. |
memory.high | High memory usage threshold. |
memory.low | Low memory usage threshold. |
memory.max | Maximum limit for memory usage. |
memory.min | Minimum limit for memory usage. |
memory.numa_stat | NUMA statistics for memory. |
memory.oom.group | OOM control for memory. |
memory.peak | Peak memory usage. |
memory.pressure | Memory pressure events. |
memory.reclaim | Memory reclaim events. |
memory.stat | Memory statistics. |
memory.swap.current | Current swap usage. |
memory.swap.events | Swap events. |
memory.swap.high | High swap usage threshold. |
memory.swap.max | Maximum limit for swap usage. |
memory.swap.peak | Peak swap usage. |
memory.zswap.current | Current zswap usage. |
memory.zswap.max | Maximum limit for zswap usage. |
misc.current | Current usage for miscellaneous resources. |
misc.events | Events for miscellaneous resources. |
misc.max | Maximum limit for miscellaneous resources. |
pids.current | Current number of processes. |
pids.events | Process events. |
pids.max | Maximum limit for the number of processes. |
pids.peak | Peak number of processes. |
rdma.current | Current usage for RDMA resources. |
rdma.max | Maximum limit for RDMA resources. |
mount options[edit | edit source]
Linux Control Groups are accessed through the cgroup2
filesystem.
# mount -t cgroup2 none <MOUNT_POINT>
Many Linux distributions automatically mount it at /sys/fs/cgroup
or /sys/fs/cgroup/unified
.
cgroup v2 currently supports the following mount options:
mount option | Description |
---|---|
nsdelegate | Consider cgroup namespaces as delegation boundaries. This option is system wide and can only be set on mount or modified through remount from the init namespace. |
favordynmods | Reduce the latencies of dynamic cgroup modifications such as task migrations and controller on/offs at the cost of making hot path operations such as forks and exits more expensive. |
memory_localevents | Only populate memory.events with data for the current cgroup, and not any subtrees. This is legacy behaviour, the default behaviour without this option is to include subtree counts. |
memory_recursiveprot | Recursively apply memory.min and memory.low protection to entire subtrees, without requiring explicit downward propagation into leaf cgroups. This allows protecting entire subtrees from one another, while retaining free competition within those subtrees. This should have been the default behavior but is a mount-option to avoid regressing setups relying on the original semantics (e.g. specifying bogusly high 'bypass' protection values at higher tree levels) |
memory_hugetlb_accounting | Count HugeTLB memory usage towards the cgroup's overall memory usage for the memory controller (for the purpose of statistics reporting and memory protetion). |
cgexec - Execute a Command in a cgroup[edit | edit source]
cgexec
is a Bash script I wrote that allows users to execute a command within a cgroup, providing control over resource limits such as CPU, I/O, and memory. It requires Linux Control Groups (cgroups) version v2, which is also known as unified cgroups.
cgexec
command. It is also used to control cgroup hierarchies, but uses a more complex setup and methods for assigning processes.Usage[edit | edit source]
# cgexec -h
Attaches a program <cmd> to a cgroup with defined limits. Requires Linux Control Groups v2. Usage: cgexec [options] <cmd> [cmd args] Options: -c cpu.weight (0-10000) Set CPU priority -C cpu.max (1-100) Set max CPU time in percent -i io.weight (1-10000) Set I/O weight -m memory.high (0-max) Set soft memory limit -M memory.max (0-max) Set hard memory limit -g group Create or attach to existing cgroup. Default is to use an ephemeral group -b path Use <path> as cgroup root
Option | Description |
---|---|
cpu.weight | Set CPU priority. 1-10000 where 10000 is highest priority, similar to nice -n -19 . 0 is special and enables the idle CPU scheduler.
|
cpu.max | Sets the maximum allowed CPU time in percentages. A value less than 100 will throttle the process periodically. |
io.weight | Sets the I/O priority. It is similar to CPU weight and limits the bandwidth available to a process if there is contention. |
memory.high | Sets a soft memory limit. If a process tries to allocate more, it will be throttled instead of cause OOM. |
memory.max | Sets an absolute limit to how much memory a process can allocate. |
group | Use an existing cgroup, or create it if it doesn't exist. Default is to use a temporary cgroup named cmd-xxxx where xxxx is a random string.
|
path | Use a specific cgroup root. Can be a nested cgroup. Useful if you want to attach the process to an exusting cgroup hierarchy. |
Examples[edit | edit source]
Execute a command with default settings
cgexec echo "Hello, cgroups!"
Limit CPU and memory for a command
cgexec -c 50 -m 1G my_command
Attach to an existing cgroup
cgexec -g mygroup my_command
Use a custom cgroup root
cgexec -b /sys/fs/cgroup/mysubsystem my_command
User Cgroups[edit | edit source]
While it is possible to give a user ownership of a cgroup, the user cannot directly use it because all processes initially belong to the cgroup root /sys/fs/cgroup/cgroup.procs
, and even though a user can write to its own cgroup, they can not remove themselves from the root cgroup.
This catch-22 can be solved by using sudo or a root user to create a cgroup, change the owner to a user and then move that user's processes to it.
One solution is to let the user start a screen
or tmux
session and then as root, check what pids belong to the user session and add them to the user cgroup:
# ps af -u forza 5309 pts/0 S+ 0:00 | \_ screen 5310 ? Ss 0:00 | \_ SCREEN 5311 pts/1 Ss 0:00 | \_ -/bin/bash 5483 pts/1 R+ 0:00 | \_ ps af -u forza # echo 5309 > /sys/fs/cgroup/user/forza/main/cgroup.procs # echo 5310 > /sys/fs/cgroup/user/forza/main/cgroup.procs # echo 5311 > /sys/fs/cgroup/user/forza/main/cgroup.procs
Now, any process started by that user will automatically belong to the same cgroup.
The user can now also use cgexec
to create nested cgroups under its own cgroup.