---
layout: post
author: Sam Hadow
---

On my server I self-host quite a lot of services but only have 5900rpm HDDs for the data, and a SSD only for the OS and binaries.  
Sometimes these HDDs struggle to keep up with the I/O operations of all my services.   
So in this short blog post I'll show you the troubleshhoting steps to find the culprit of a high disk I/O and how to limit its disk usage.  

## Check disk usage

To check disks usage we can use the tool iostat *(provided by the package `sysstat` on fedora, debian and archlinux)*

to see the extended stats every second:

```bash
iostat -x 1
```

you'll then get an output like this:

```bash
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.37    0.00    6.94   21.85    0.00   66.84

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
dm-0             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
dm-1             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
dm-2             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
dm-3             0.00      0.00     0.00   0.00    0.00     0.00  524.00   8384.00     0.00   0.00   47.82    16.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   25.06 100.00
dm-4             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
dm-5             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sda              0.00      0.00     0.00   0.00    0.00     0.00  524.00   8384.00     0.00   0.00   46.02    16.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   24.11  99.30
sdb              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdc              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
zram0            1.00      4.00     0.00   0.00    0.00     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

```

let's explain each column:

| Column       | Meaning                                                                                                                                                           |
| :- | :-|
| **Device**   | The **block device** name (e.g. `sda`, `dm-0`, etc.).                                                                                                        |
| **r/s**      | Number of **read requests per second** issued to the device.                                                                                                                                                                                                    |
| **rkB/s**    | Amount of **data read per second**, in kilobytes.                                                                                                                   |
| **rrqm/s**   | Number of **merged read requests per second** (the kernel merges adjacent reads into a single I/O).                                                                 |
| **%rrqm**    | Percentage of read requests merged - calculated as `100 * rrqm/s / (r/s + rrqm/s)`.                                                                       |
| **r_await**  | Average time (in milliseconds) for read requests to be served - includes both queue time and service time.                                                          |
| **rareq-sz** | Average size (in kilobytes) of each read request.                                                                                                                   |
| **w/s**      | Number of **write requests per second** issued to the device.                                                                                                       |
| **wkB/s**    | Amount of **data written per second**, in kilobytes.                                                                                                                |
| **wrqm/s**   | Number of **merged write requests per second**.                                                                                                                     |
| **%wrqm**    | Percentage of write requests that were merged - calculated in a similar way to %rrqm.                                                                                                                    |
| **w_await**  | Average time (ms) for write requests to complete.                                                                                                                   |
| **wareq-sz** | Average size (kB) of each write request.                                                                                                                            |
| **d/s**      | Number of **discard requests per second** (TRIM / UNMAP commands - mostly on SSDs).                                                                                 |
| **dkB/s**    | Amount of data discarded per second (in kB).                                                                                                                        |
| **drqm/s**   | Merged discard requests per second.                                                                                                                                 |
| **%drqm**    | Percentage of discard requests merged.                                                                                                                              |
| **d_await**  | Average time (ms) for discard requests to complete.                                                                                                                 |
| **dareq-sz** | Average discard request size (kB).                                                                                                                                  |
| **f/s**      | Number of **flush requests per second** — these force buffered data to non-volatile storage.                                                                        |
| **f_await**  | Average time (ms) for flush requests to complete.                                                                                                                   |
| **aqu-sz**   | **Average queue size** — the average number of I/O requests waiting in the queue or being serviced during the sample interval.                                      |
| **%util**    | **Percentage of time the device was busy** processing I/O requests. Values near 100% indicate full utilization; but it can exceed 100% with multi-queues or parallel I/Os. |


The most interesting columns for us are:

+ **%util** : high means device is saturated (I/O-bound).
+ **r_await**, **w_await**, **f_await**: high means requests are waiting a long time (latency issue).
+ **aqu-sz**: high means the I/O operations queue is growing.
+ **r/s**, **w/s**: high means lots of small I/O operations.

*fun fact: Although iostat displays units corresponding to kilobytes (kB), megabytes (MB)..., it actually uses kibibytes (kiB), mebibytes (MiB)... A kibibyte is equal to 1024 bytes, and a mebibyte is equal to 1024 kibibytes. [source](https://man.archlinux.org/man/iostat.1.en)*

### note
In the previous example, dm-* devices are actually virtual block devices managed by LVM (the device mapper).  

To identify which physical volumes they correspond to, we can run this command:
```bash
ls -l /dev/mapper
```
Or this command as root:
```bash
dmsetup ls --tree
```

## Find the process causing a high I/O usage
For that we can use the iotop command *(package is named `iotop` on fedora, debian and archlinux)*. We usually need to run iotop as root as it needs elevated privileges.  
With the following options it's easier to spot processes causing a high I/O usage:
```bash
sudo iotop -aoP
```
what these options do:  
`-a` = accumulated I/O since start  
`-o` = only show processes actually doing I/O  
`-P` = show per-process, not per-thread  

We can also use pidstat *(provided by the package `sysstat` on fedora, debian and archlinux)*. It's better to run this command as root, otherwise you'll only the processes from your the user running the command and not all the processes. 

to show per process read/write operations, updating every second:

```bash
pidstat -d 1
```

We can then write down the PID, or the command, corresponding to the line with a lot of disk writes, or disk reads.

# Limit disk usage

## podman

With podman we can use arguments in the run command to limit disk I/Os as mentionned in the [documentation](https://docs.podman.io/en/stable/markdown/podman-run.1.html)

| Argument       | effect                                                                                                                                                           |
| :- | :-|
| **--device-read-bps=path:rate**   | Limit read rate (in bytes per second) from a device *(e.g. --device-read-bps=/dev/sda:1mb)*.                                                                                                        |
| **--device-read-iops=path:rate**   | Limit read rate (in IO operations per second) from a device *(e.g. --device-read-iops=/dev/sda:1000)*.                                                                                                        |
| **--device-write-bps=path:rate**   | Limit write rate (in bytes per second) to a device *(e.g. --device-write-bps=/dev/sda:1mb)*.                                                                                                        |
| **--device-write-iops=path:rate**   | Limit write rate (in IO operations per second) to a device *(e.g. --device-write-iops=/dev/sda:1000)*.                                                                                                        |

> These may not work in rootless mode unless I/O delegation is enabled.

You can verify resource limit delegations enabled with this command:
```bash
cat "/sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers"
```
In our case we need io in the output.

If it's not present you can create the file `/etc/systemd/system/user@.service.d/delegate.conf` with the following content:
```ini
[Service]
Delegate=io
```
You can also add the other resource limit you want to delegate to users, for example: memory pids cpu cpuset, the file would look like this:
```ini
[Service]
Delegate=io memory pids cpu cpuset
```

You then need to log out and log back in to have the correct limits permissions.

## systemd

To limit disk I/O for a systemd service we can use slices

The most useful options we can put in a slice section are the following, you can see all the available options in the [documentation](https://www.freedesktop.org/software/systemd/man/latest/systemd.resource-control.html)
| Property                                         | Description                                                                                                                                                       |
| :- | :- |
| **`IOAccounting=`**                              | Enables collection of I/O statistics (used by `systemd-cgtop`, `systemd-analyze`, etc.).                               |
| **`IOWeight=weight`**                                  | Sets *relative* I/O priority (1–10000, default 100). A higher value gives the unit a larger share of available bandwidth when multiple units compete.             |
| **`IODeviceWeight=device weight`**               | Assigns a per-device weight, overriding `IOWeight` for that device.                                                                                               |
| **`IOReadBandwidthMax=device bytes`**            | Sets an **absolute** cap on read bandwidth, e.g. `/dev/sda 10M`. Units cannot exceed this, even if idle bandwidth exists. Possible units are: K, M, G, or T for Kilobytes, Megabytes, Gigabytes, or Terabytes, respectively. Otherwise the bandwidth is parsed in bytes/s.                                         |
| **`IOWriteBandwidthMax=device bytes`**           | Same, but for write bandwidth.                                                                                                                                    |
| **`IOReadIOPSMax=device limit`**                 | Caps the number of read operations per second, e.g. `/dev/nvme0n1 500`.                                                                                           |
| **`IOWriteIOPSMax=device limit`**                | Caps the number of write operations per second.                                                                                                                   


To create a slice unit, for example: `/etc/systemd/system/io-limited.slice`
```ini
[Unit]
Description=Slice for IO-limited services

[Slice]
IOAccounting=yes
IOWriteBandwidthMax=/dev/sda 20M
IOReadBandwidthMax=/dev/sda 20M
```

We can then assign services to this slice, in the service section we would have:
```ini
[Service]
Slice=io-limited.slice
```