178 lines
14 KiB
Markdown
178 lines
14 KiB
Markdown
---
|
||
layout: post
|
||
author: Sam Hadow
|
||
---
|
||
|
||
On my server I self-host quite a lot of services but only have 5900rpm HDDs for the data, and a SSD only for the OS and binaries.
|
||
Sometimes these HDDs struggle to keep up with the I/O operations of all my services.
|
||
So in this short blog post I'll show you the troubleshhoting steps to find the culprit of a high disk I/O and how to limit its disk usage.
|
||
|
||
## Check disk usage
|
||
|
||
To check disks usage we can use the tool iostat *(provided by the package `sysstat` on fedora, debian and archlinux)*
|
||
|
||
to see the extended stats every second:
|
||
|
||
```bash
|
||
iostat -x 1
|
||
```
|
||
|
||
you'll then get an output like this:
|
||
|
||
```bash
|
||
avg-cpu: %user %nice %system %iowait %steal %idle
|
||
4.37 0.00 6.94 21.85 0.00 66.84
|
||
|
||
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
|
||
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
|
||
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
|
||
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
|
||
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 524.00 8384.00 0.00 0.00 47.82 16.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 25.06 100.00
|
||
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
|
||
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
|
||
sda 0.00 0.00 0.00 0.00 0.00 0.00 524.00 8384.00 0.00 0.00 46.02 16.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 24.11 99.30
|
||
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
|
||
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
|
||
zram0 1.00 4.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
|
||
|
||
```
|
||
|
||
let's explain each column:
|
||
|
||
| Column | Meaning |
|
||
| :- | :-|
|
||
| **Device** | The **block device** name (e.g. `sda`, `dm-0`, etc.). |
|
||
| **r/s** | Number of **read requests per second** issued to the device. |
|
||
| **rkB/s** | Amount of **data read per second**, in kilobytes. |
|
||
| **rrqm/s** | Number of **merged read requests per second** (the kernel merges adjacent reads into a single I/O). |
|
||
| **%rrqm** | Percentage of read requests merged - calculated as `100 * rrqm/s / (r/s + rrqm/s)`. |
|
||
| **r_await** | Average time (in milliseconds) for read requests to be served - includes both queue time and service time. |
|
||
| **rareq-sz** | Average size (in kilobytes) of each read request. |
|
||
| **w/s** | Number of **write requests per second** issued to the device. |
|
||
| **wkB/s** | Amount of **data written per second**, in kilobytes. |
|
||
| **wrqm/s** | Number of **merged write requests per second**. |
|
||
| **%wrqm** | Percentage of write requests that were merged - calculated in a similar way to %rrqm. |
|
||
| **w_await** | Average time (ms) for write requests to complete. |
|
||
| **wareq-sz** | Average size (kB) of each write request. |
|
||
| **d/s** | Number of **discard requests per second** (TRIM / UNMAP commands - mostly on SSDs). |
|
||
| **dkB/s** | Amount of data discarded per second (in kB). |
|
||
| **drqm/s** | Merged discard requests per second. |
|
||
| **%drqm** | Percentage of discard requests merged. |
|
||
| **d_await** | Average time (ms) for discard requests to complete. |
|
||
| **dareq-sz** | Average discard request size (kB). |
|
||
| **f/s** | Number of **flush requests per second** — these force buffered data to non-volatile storage. |
|
||
| **f_await** | Average time (ms) for flush requests to complete. |
|
||
| **aqu-sz** | **Average queue size** — the average number of I/O requests waiting in the queue or being serviced during the sample interval. |
|
||
| **%util** | **Percentage of time the device was busy** processing I/O requests. Values near 100% indicate full utilization; but it can exceed 100% with multi-queues or parallel I/Os. |
|
||
|
||
|
||
The most interesting columns for us are:
|
||
|
||
+ **%util** : high means device is saturated (I/O-bound).
|
||
+ **r_await**, **w_await**, **f_await**: high means requests are waiting a long time (latency issue).
|
||
+ **aqu-sz**: high means the I/O operations queue is growing.
|
||
+ **r/s**, **w/s**: high means lots of small I/O operations.
|
||
|
||
*fun fact: Although iostat displays units corresponding to kilobytes (kB), megabytes (MB)..., it actually uses kibibytes (kiB), mebibytes (MiB)... A kibibyte is equal to 1024 bytes, and a mebibyte is equal to 1024 kibibytes. [source](https://man.archlinux.org/man/iostat.1.en)*
|
||
|
||
### note
|
||
In the previous example, dm-* devices are actually virtual block devices managed by LVM (the device mapper).
|
||
|
||
To identify which physical volumes they correspond to, we can run this command:
|
||
```bash
|
||
ls -l /dev/mapper
|
||
```
|
||
Or this command as root:
|
||
```bash
|
||
dmsetup ls --tree
|
||
```
|
||
|
||
## Find the process causing a high I/O usage
|
||
For that we can use the iotop command *(package is named `iotop` on fedora, debian and archlinux)*. We usually need to run iotop as root as it needs elevated privileges.
|
||
With the following options it's easier to spot processes causing a high I/O usage:
|
||
```bash
|
||
sudo iotop -aoP
|
||
```
|
||
what these options do:
|
||
`-a` = accumulated I/O since start
|
||
`-o` = only show processes actually doing I/O
|
||
`-P` = show per-process, not per-thread
|
||
|
||
We can also use pidstat *(provided by the package `sysstat` on fedora, debian and archlinux)*. It's better to run this command as root, otherwise you'll only the processes from your the user running the command and not all the processes.
|
||
|
||
to show per process read/write operations, updating every second:
|
||
|
||
```bash
|
||
pidstat -d 1
|
||
```
|
||
|
||
We can then write down the PID, or the command, corresponding to the line with a lot of disk writes, or disk reads.
|
||
|
||
# Limit disk usage
|
||
|
||
## podman
|
||
|
||
With podman we can use arguments in the run command to limit disk I/Os as mentionned in the [documentation](https://docs.podman.io/en/stable/markdown/podman-run.1.html)
|
||
|
||
| Argument | effect |
|
||
| :- | :-|
|
||
| **--device-read-bps=path:rate** | Limit read rate (in bytes per second) from a device *(e.g. --device-read-bps=/dev/sda:1mb)*. |
|
||
| **--device-read-iops=path:rate** | Limit read rate (in IO operations per second) from a device *(e.g. --device-read-iops=/dev/sda:1000)*. |
|
||
| **--device-write-bps=path:rate** | Limit write rate (in bytes per second) to a device *(e.g. --device-write-bps=/dev/sda:1mb)*. |
|
||
| **--device-write-iops=path:rate** | Limit write rate (in IO operations per second) to a device *(e.g. --device-write-iops=/dev/sda:1000)*. |
|
||
|
||
> These may not work in rootless mode unless I/O delegation is enabled.
|
||
|
||
You can verify resource limit delegations enabled with this command:
|
||
```bash
|
||
cat "/sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers"
|
||
```
|
||
In our case we need io in the output.
|
||
|
||
If it's not present you can create the file `/etc/systemd/system/user@.service.d/delegate.conf` with the following content:
|
||
```ini
|
||
[Service]
|
||
Delegate=io
|
||
```
|
||
You can also add the other resource limit you want to delegate to users, for example: memory pids cpu cpuset, the file would look like this:
|
||
```ini
|
||
[Service]
|
||
Delegate=io memory pids cpu cpuset
|
||
```
|
||
|
||
You then need to log out and log back in to have the correct limits permissions.
|
||
|
||
## systemd
|
||
|
||
To limit disk I/O for a systemd service we can use slices
|
||
|
||
The most useful options we can put in a slice section are the following, you can see all the available options in the [documentation](https://www.freedesktop.org/software/systemd/man/latest/systemd.resource-control.html)
|
||
|
||
| Property | Description |
|
||
| :- | :- |
|
||
| **`IOAccounting=`** | Enables collection of I/O statistics (used by `systemd-cgtop`, `systemd-analyze`, etc.). |
|
||
| **`IOWeight=weight`** | Sets *relative* I/O priority (1–10000, default 100). A higher value gives the unit a larger share of available bandwidth when multiple units compete. |
|
||
| **`IODeviceWeight=device weight`** | Assigns a per-device weight, overriding `IOWeight` for that device. |
|
||
| **`IOReadBandwidthMax=device bytes`** | Sets an **absolute** cap on read bandwidth, e.g. `/dev/sda 10M`. Units cannot exceed this, even if idle bandwidth exists. Possible units are: K, M, G, or T for Kilobytes, Megabytes, Gigabytes, or Terabytes, respectively. Otherwise the bandwidth is parsed in bytes/s. |
|
||
| **`IOWriteBandwidthMax=device bytes`** | Same, but for write bandwidth. |
|
||
| **`IOReadIOPSMax=device limit`** | Caps the number of read operations per second, e.g. `/dev/nvme0n1 500`. |
|
||
| **`IOWriteIOPSMax=device limit`** | Caps the number of write operations per second.
|
||
|
||
|
||
To create a slice unit, for example: `/etc/systemd/system/io-limited.slice`
|
||
```ini
|
||
[Unit]
|
||
Description=Slice for IO-limited services
|
||
|
||
[Slice]
|
||
IOAccounting=yes
|
||
IOWriteBandwidthMax=/dev/sda 20M
|
||
IOReadBandwidthMax=/dev/sda 20M
|
||
```
|
||
|
||
We can then assign services to this slice, in the service section we would have:
|
||
```ini
|
||
[Service]
|
||
Slice=io-limited.slice
|
||
```
|