From 892b28fefa9b883e205a394667c2cbdd65b4ba08 Mon Sep 17 00:00:00 2001
From: Sam Hadow <sam.hadow@inbox.lv>
Date: Thu, 13 Nov 2025 22:45:22 +0100
Subject: [PATCH] new post, disk I/O

---
 _posts/2025-11-13-disk-IO-troubleshooting.md | 176 +++++++++++++++++++
 1 file changed, 176 insertions(+)
 create mode 100644 _posts/2025-11-13-disk-IO-troubleshooting.md

diff --git a/_posts/2025-11-13-disk-IO-troubleshooting.md b/_posts/2025-11-13-disk-IO-troubleshooting.md
new file mode 100644
index 0000000..9cf1086
--- /dev/null
+++ b/_posts/2025-11-13-disk-IO-troubleshooting.md
@@ -0,0 +1,176 @@
+---
+layout: post
+author: Sam Hadow
+---
+
+On my server I self-host quite a lot of services but only have 5900rpm HDDs for the data, and a SSD only for the OS and binaries.  
+Sometimes these HDDs struggle to keep up with the I/O operations of all my services.   
+So in this short blog post I'll show you the troubleshhoting steps to find the culprit of a high disk I/O and how to limit its disk usage.  
+
+## Check disk usage
+
+To check disks usage we can use the tool iostat *(provided by the package `sysstat` on fedora, debian and archlinux)*
+
+to see the extended stats every second:
+
+```bash
+iostat -x 1
+```
+
+you'll then get an output like this:
+
+```bash
+avg-cpu:  %user   %nice %system %iowait  %steal   %idle
+           4.37    0.00    6.94   21.85    0.00   66.84
+
+Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
+dm-0             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
+dm-1             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
+dm-2             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
+dm-3             0.00      0.00     0.00   0.00    0.00     0.00  524.00   8384.00     0.00   0.00   47.82    16.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   25.06 100.00
+dm-4             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
+dm-5             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
+sda              0.00      0.00     0.00   0.00    0.00     0.00  524.00   8384.00     0.00   0.00   46.02    16.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   24.11  99.30
+sdb              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
+sdc              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
+zram0            1.00      4.00     0.00   0.00    0.00     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
+
+```
+
+let's explain each column:
+
+| Column       | Meaning                                                                                                                                                           |
+| :- | :-|
+| **Device**   | The **block device** name (e.g. `sda`, `dm-0`, etc.).                                                                                                        |
+| **r/s**      | Number of **read requests per second** issued to the device.                                                                                                                                                                                                    |
+| **rkB/s**    | Amount of **data read per second**, in kilobytes.                                                                                                                   |
+| **rrqm/s**   | Number of **merged read requests per second** (the kernel merges adjacent reads into a single I/O).                                                                 |
+| **%rrqm**    | Percentage of read requests merged - calculated as `100 * rrqm/s / (r/s + rrqm/s)`.                                                                       |
+| **r_await**  | Average time (in milliseconds) for read requests to be served - includes both queue time and service time.                                                          |
+| **rareq-sz** | Average size (in kilobytes) of each read request.                                                                                                                   |
+| **w/s**      | Number of **write requests per second** issued to the device.                                                                                                       |
+| **wkB/s**    | Amount of **data written per second**, in kilobytes.                                                                                                                |
+| **wrqm/s**   | Number of **merged write requests per second**.                                                                                                                     |
+| **%wrqm**    | Percentage of write requests that were merged - calculated in a similar way to %rrqm.                                                                                                                    |
+| **w_await**  | Average time (ms) for write requests to complete.                                                                                                                   |
+| **wareq-sz** | Average size (kB) of each write request.                                                                                                                            |
+| **d/s**      | Number of **discard requests per second** (TRIM / UNMAP commands - mostly on SSDs).                                                                                 |
+| **dkB/s**    | Amount of data discarded per second (in kB).                                                                                                                        |
+| **drqm/s**   | Merged discard requests per second.                                                                                                                                 |
+| **%drqm**    | Percentage of discard requests merged.                                                                                                                              |
+| **d_await**  | Average time (ms) for discard requests to complete.                                                                                                                 |
+| **dareq-sz** | Average discard request size (kB).                                                                                                                                  |
+| **f/s**      | Number of **flush requests per second** — these force buffered data to non-volatile storage.                                                                        |
+| **f_await**  | Average time (ms) for flush requests to complete.                                                                                                                   |
+| **aqu-sz**   | **Average queue size** — the average number of I/O requests waiting in the queue or being serviced during the sample interval.                                      |
+| **%util**    | **Percentage of time the device was busy** processing I/O requests. Values near 100% indicate full utilization; but it can exceed 100% with multi-queues or parallel I/Os. |
+
+
+The most interesting columns for us are:
+
++ **%util** : high means device is saturated (I/O-bound).
++ **r_await**, **w_await**, **f_await**: high means requests are waiting a long time (latency issue).
++ **aqu-sz**: high means the I/O operations queue is growing.
++ **r/s**, **w/s**: high means lots of small I/O operations.
+
+*fun fact: Although iostat displays units corresponding to kilobytes (kB), megabytes (MB)..., it actually uses kibibytes (kiB), mebibytes (MiB)... A kibibyte is equal to 1024 bytes, and a mebibyte is equal to 1024 kibibytes. [source](https://man.archlinux.org/man/iostat.1.en)*
+
+### note
+In the previous example, dm-* devices are actually virtual block devices managed by LVM (the device mapper).  
+
+To identify which physical volumes they correspond to, we can run this command:
+```bash
+ls -l /dev/mapper
+```
+Or this command as root:
+```bash
+dmsetup ls --tree
+```
+
+## Find the process causing a high I/O usage
+For that we can use the iotop command *(package is named `iotop` on fedora, debian and archlinux)*. We usually need to run iotop as root as it needs elevated privileges.  
+With the following options it's easier to spot processes causing a high I/O usage:
+```bash
+sudo iotop -aoP
+```
+what these options do:  
+`-a` = accumulated I/O since start  
+`-o` = only show processes actually doing I/O  
+`-P` = show per-process, not per-thread  
+
+We can also use pidstat *(provided by the package `sysstat` on fedora, debian and archlinux)*. It's better to run this command as root, otherwise you'll only the processes from your the user running the command and not all the processes. 
+
+to show per process read/write operations, updating every second:
+
+```bash
+pidstat -d 1
+```
+
+We can then write down the PID, or the command, corresponding to the line with a lot of disk writes, or disk reads.
+
+# Limit disk usage
+
+## podman
+
+With podman we can use arguments in the run command to limit disk I/Os as mentionned in the [documentation](https://docs.podman.io/en/stable/markdown/podman-run.1.html)
+
+| Argument       | effect                                                                                                                                                           |
+| :- | :-|
+| **--device-read-bps=path:rate**   | Limit read rate (in bytes per second) from a device *(e.g. --device-read-bps=/dev/sda:1mb)*.                                                                                                        |
+| **--device-read-iops=path:rate**   | Limit read rate (in IO operations per second) from a device *(e.g. --device-read-iops=/dev/sda:1000)*.                                                                                                        |
+| **--device-write-bps=path:rate**   | Limit write rate (in bytes per second) to a device *(e.g. --device-write-bps=/dev/sda:1mb)*.                                                                                                        |
+| **--device-write-iops=path:rate**   | Limit write rate (in IO operations per second) to a device *(e.g. --device-write-iops=/dev/sda:1000)*.                                                                                                        |
+
+> These may not work in rootless mode unless I/O delegation is enabled.
+
+You can verify resource limit delegations enabled with this command:
+```bash
+cat "/sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers"
+```
+In our case we need io in the output.
+
+If it's not present you can create the file `/etc/systemd/system/user@.service.d/delegate.conf` with the following content:
+```ini
+[Service]
+Delegate=io
+```
+You can also add the other resource limit you want to delegate to users, for example: memory pids cpu cpuset, the file would look like this:
+```ini
+[Service]
+Delegate=io memory pids cpu cpuset
+```
+
+You then need to log out and log back in to have the correct limits permissions.
+
+## systemd
+
+To limit disk I/O for a systemd service we can use slices
+
+The most useful options we can put in a slice section are the following, you can see all the available options in the [documentation](https://www.freedesktop.org/software/systemd/man/latest/systemd.resource-control.html)
+| Property                                         | Description                                                                                                                                                       |
+| :- | :- |
+| **`IOAccounting=`**                              | Enables collection of I/O statistics (used by `systemd-cgtop`, `systemd-analyze`, etc.).                               |
+| **`IOWeight=weight`**                                  | Sets *relative* I/O priority (1–10000, default 100). A higher value gives the unit a larger share of available bandwidth when multiple units compete.             |
+| **`IODeviceWeight=device weight`**               | Assigns a per-device weight, overriding `IOWeight` for that device.                                                                                               |
+| **`IOReadBandwidthMax=device bytes`**            | Sets an **absolute** cap on read bandwidth, e.g. `/dev/sda 10M`. Units cannot exceed this, even if idle bandwidth exists. Possible units are: K, M, G, or T for Kilobytes, Megabytes, Gigabytes, or Terabytes, respectively. Otherwise the bandwidth is parsed in bytes/s.                                         |
+| **`IOWriteBandwidthMax=device bytes`**           | Same, but for write bandwidth.                                                                                                                                    |
+| **`IOReadIOPSMax=device limit`**                 | Caps the number of read operations per second, e.g. `/dev/nvme0n1 500`.                                                                                           |
+| **`IOWriteIOPSMax=device limit`**                | Caps the number of write operations per second.                                                                                                                   
+
+
+To create a slice unit, for example: `/etc/systemd/system/io-limited.slice`
+```ini
+[Unit]
+Description=Slice for IO-limited services
+
+[Slice]
+IOAccounting=yes
+IOWriteBandwidthMax=/dev/sda 20M
+IOReadBandwidthMax=/dev/sda 20M
+```
+
+We can then assign services to this slice, in the service section we would have:
+```ini
+[Service]
+Slice=io-limited.slice
+```