From 892b28fefa9b883e205a394667c2cbdd65b4ba08 Mon Sep 17 00:00:00 2001 From: Sam Hadow Date: Thu, 13 Nov 2025 22:45:22 +0100 Subject: [PATCH] new post, disk I/O --- _posts/2025-11-13-disk-IO-troubleshooting.md | 176 +++++++++++++++++++ 1 file changed, 176 insertions(+) create mode 100644 _posts/2025-11-13-disk-IO-troubleshooting.md diff --git a/_posts/2025-11-13-disk-IO-troubleshooting.md b/_posts/2025-11-13-disk-IO-troubleshooting.md new file mode 100644 index 0000000..9cf1086 --- /dev/null +++ b/_posts/2025-11-13-disk-IO-troubleshooting.md @@ -0,0 +1,176 @@ +--- +layout: post +author: Sam Hadow +--- + +On my server I self-host quite a lot of services but only have 5900rpm HDDs for the data, and a SSD only for the OS and binaries. +Sometimes these HDDs struggle to keep up with the I/O operations of all my services. +So in this short blog post I'll show you the troubleshhoting steps to find the culprit of a high disk I/O and how to limit its disk usage. + +## Check disk usage + +To check disks usage we can use the tool iostat *(provided by the package `sysstat` on fedora, debian and archlinux)* + +to see the extended stats every second: + +```bash +iostat -x 1 +``` + +you'll then get an output like this: + +```bash +avg-cpu: %user %nice %system %iowait %steal %idle + 4.37 0.00 6.94 21.85 0.00 66.84 + +Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util +dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 +dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 +dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 +dm-3 0.00 0.00 0.00 0.00 0.00 0.00 524.00 8384.00 0.00 0.00 47.82 16.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 25.06 100.00 +dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 +dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 +sda 0.00 0.00 0.00 0.00 0.00 0.00 524.00 8384.00 0.00 0.00 46.02 16.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 24.11 99.30 +sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 +sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 +zram0 1.00 4.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 + +``` + +let's explain each column: + +| Column | Meaning | +| :- | :-| +| **Device** | The **block device** name (e.g. `sda`, `dm-0`, etc.). | +| **r/s** | Number of **read requests per second** issued to the device. | +| **rkB/s** | Amount of **data read per second**, in kilobytes. | +| **rrqm/s** | Number of **merged read requests per second** (the kernel merges adjacent reads into a single I/O). | +| **%rrqm** | Percentage of read requests merged - calculated as `100 * rrqm/s / (r/s + rrqm/s)`. | +| **r_await** | Average time (in milliseconds) for read requests to be served - includes both queue time and service time. | +| **rareq-sz** | Average size (in kilobytes) of each read request. | +| **w/s** | Number of **write requests per second** issued to the device. | +| **wkB/s** | Amount of **data written per second**, in kilobytes. | +| **wrqm/s** | Number of **merged write requests per second**. | +| **%wrqm** | Percentage of write requests that were merged - calculated in a similar way to %rrqm. | +| **w_await** | Average time (ms) for write requests to complete. | +| **wareq-sz** | Average size (kB) of each write request. | +| **d/s** | Number of **discard requests per second** (TRIM / UNMAP commands - mostly on SSDs). | +| **dkB/s** | Amount of data discarded per second (in kB). | +| **drqm/s** | Merged discard requests per second. | +| **%drqm** | Percentage of discard requests merged. | +| **d_await** | Average time (ms) for discard requests to complete. | +| **dareq-sz** | Average discard request size (kB). | +| **f/s** | Number of **flush requests per second** — these force buffered data to non-volatile storage. | +| **f_await** | Average time (ms) for flush requests to complete. | +| **aqu-sz** | **Average queue size** — the average number of I/O requests waiting in the queue or being serviced during the sample interval. | +| **%util** | **Percentage of time the device was busy** processing I/O requests. Values near 100% indicate full utilization; but it can exceed 100% with multi-queues or parallel I/Os. | + + +The most interesting columns for us are: + ++ **%util** : high means device is saturated (I/O-bound). ++ **r_await**, **w_await**, **f_await**: high means requests are waiting a long time (latency issue). ++ **aqu-sz**: high means the I/O operations queue is growing. ++ **r/s**, **w/s**: high means lots of small I/O operations. + +*fun fact: Although iostat displays units corresponding to kilobytes (kB), megabytes (MB)..., it actually uses kibibytes (kiB), mebibytes (MiB)... A kibibyte is equal to 1024 bytes, and a mebibyte is equal to 1024 kibibytes. [source](https://man.archlinux.org/man/iostat.1.en)* + +### note +In the previous example, dm-* devices are actually virtual block devices managed by LVM (the device mapper). + +To identify which physical volumes they correspond to, we can run this command: +```bash +ls -l /dev/mapper +``` +Or this command as root: +```bash +dmsetup ls --tree +``` + +## Find the process causing a high I/O usage +For that we can use the iotop command *(package is named `iotop` on fedora, debian and archlinux)*. We usually need to run iotop as root as it needs elevated privileges. +With the following options it's easier to spot processes causing a high I/O usage: +```bash +sudo iotop -aoP +``` +what these options do: +`-a` = accumulated I/O since start +`-o` = only show processes actually doing I/O +`-P` = show per-process, not per-thread + +We can also use pidstat *(provided by the package `sysstat` on fedora, debian and archlinux)*. It's better to run this command as root, otherwise you'll only the processes from your the user running the command and not all the processes. + +to show per process read/write operations, updating every second: + +```bash +pidstat -d 1 +``` + +We can then write down the PID, or the command, corresponding to the line with a lot of disk writes, or disk reads. + +# Limit disk usage + +## podman + +With podman we can use arguments in the run command to limit disk I/Os as mentionned in the [documentation](https://docs.podman.io/en/stable/markdown/podman-run.1.html) + +| Argument | effect | +| :- | :-| +| **--device-read-bps=path:rate** | Limit read rate (in bytes per second) from a device *(e.g. --device-read-bps=/dev/sda:1mb)*. | +| **--device-read-iops=path:rate** | Limit read rate (in IO operations per second) from a device *(e.g. --device-read-iops=/dev/sda:1000)*. | +| **--device-write-bps=path:rate** | Limit write rate (in bytes per second) to a device *(e.g. --device-write-bps=/dev/sda:1mb)*. | +| **--device-write-iops=path:rate** | Limit write rate (in IO operations per second) to a device *(e.g. --device-write-iops=/dev/sda:1000)*. | + +> These may not work in rootless mode unless I/O delegation is enabled. + +You can verify resource limit delegations enabled with this command: +```bash +cat "/sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers" +``` +In our case we need io in the output. + +If it's not present you can create the file `/etc/systemd/system/user@.service.d/delegate.conf` with the following content: +```ini +[Service] +Delegate=io +``` +You can also add the other resource limit you want to delegate to users, for example: memory pids cpu cpuset, the file would look like this: +```ini +[Service] +Delegate=io memory pids cpu cpuset +``` + +You then need to log out and log back in to have the correct limits permissions. + +## systemd + +To limit disk I/O for a systemd service we can use slices + +The most useful options we can put in a slice section are the following, you can see all the available options in the [documentation](https://www.freedesktop.org/software/systemd/man/latest/systemd.resource-control.html) +| Property | Description | +| :- | :- | +| **`IOAccounting=`** | Enables collection of I/O statistics (used by `systemd-cgtop`, `systemd-analyze`, etc.). | +| **`IOWeight=weight`** | Sets *relative* I/O priority (1–10000, default 100). A higher value gives the unit a larger share of available bandwidth when multiple units compete. | +| **`IODeviceWeight=device weight`** | Assigns a per-device weight, overriding `IOWeight` for that device. | +| **`IOReadBandwidthMax=device bytes`** | Sets an **absolute** cap on read bandwidth, e.g. `/dev/sda 10M`. Units cannot exceed this, even if idle bandwidth exists. Possible units are: K, M, G, or T for Kilobytes, Megabytes, Gigabytes, or Terabytes, respectively. Otherwise the bandwidth is parsed in bytes/s. | +| **`IOWriteBandwidthMax=device bytes`** | Same, but for write bandwidth. | +| **`IOReadIOPSMax=device limit`** | Caps the number of read operations per second, e.g. `/dev/nvme0n1 500`. | +| **`IOWriteIOPSMax=device limit`** | Caps the number of write operations per second. + + +To create a slice unit, for example: `/etc/systemd/system/io-limited.slice` +```ini +[Unit] +Description=Slice for IO-limited services + +[Slice] +IOAccounting=yes +IOWriteBandwidthMax=/dev/sda 20M +IOReadBandwidthMax=/dev/sda 20M +``` + +We can then assign services to this slice, in the service section we would have: +```ini +[Service] +Slice=io-limited.slice +```