By chance I found the Blog and Videos of Tanel Poder. The content is excellent if you want to get into low-level I/O debugging on Linux, I especially recommend the following blogpost with its corresponding video:

Inspired by the detail of his analysis I started to dig around in Linux block I/O for a while. In this blogpost I collect some useful infos and tools for Linux storage debugging mostly as a notepad for myself. May it be helpful for you too.

Tools

bcc

The BPF Compiler Collection has a huge set of tools to sniff and snoop on various points of userland/kernel interaction. No matter if you want to debug TCP connections, memory management or block I/O you’ll find something here. The collection is available on most distros as bcc-tools but the scripts are often not in $PATH but must be called manually from /usr/share/bcc/tools/. All the tools use small BPF programs which are loaded into the kernel to probe a function or event and collect information about it. If you’re interested in how to do that check out the cool bcc Python Developer Tutorial.

Intels mlc: Memory Latency Checker

The Memory Latency Checker is useful to check the throughput and latency from CPU to RAM. Especially with NUMA systems this can be of great interest to find bottlenecks as the throughput of a CPU might be capped if the memory latency is too high.
See also numactl and lstopo.

lstopo

When it comes to the physical layout of a machine there are tools like lshw or lspci which can give good information. Additionally the command lstopo --of ascii from the package hwloc can draw a picture in a terminal of the hardware topology.

0x.tools

The tools on 0x.tools are interesting for Linux application debugging. While I did not use them often yet, they contain good examples of how to use perf to get a tree visualization of where in the kernel code the CPU spends most time.

perf record -g -F 2 -a -o perf_log
perf report -i perf_log

dstat

dstat is the successor of the known tools like vmstat, iostat and ifstat. It aiming to unify the interface, making usage easier and adding more information. See https://linux.die.net/man/1/dstat for a full manual. For debugging storage I often use dstat -pcmrd to see IOPS and throughput like this:

[root@test /tmp]# dstat -pcmrd
---procs--- ----total-usage---- ------memory-usage----- --io/total- -dsk/total-
run blk new|usr sys idl wai stl| used  free  buf   cach| read  writ| read  writ
1.0   0    |                   | 594M  257M 2172k 2670M|           |
1.0   0   0|  1  55  42   0   0| 594M  185M 2172k 2742M|   0   184 |   0   120M
1.0   0   0|  1  57  41   0   0| 590M  131M 2172k 2801M|   0   175 |   0   164M
1.0   0   0|  1  57  41   0   0| 587M  121M 2172k 2812M|1.00   195 |4096B  164M
1.0   0 3.0|  1  57  40   0   0| 584M  111M 2172k 2826M|1.00   234 |4095B  160M
1.0   0   0|  3  57  41   0   0| 582M  104M 2172k 2836M|   0   222 |   0   148M
1.0   0   0|  0  62  30   6   0| 578M  109M 2172k 2835M|   0   460 |   0   365M

Here one can see, that one process writes rather large IOPS which occupies more than half a CPU with in kernel code.

Typical issues

Kernel splits I/O operations

Even though an application submits I/O with a specific blocksize to the kernel the operations might get split by the blocklayer before they get to the actual disks. I don’t still don’t know when this happens exactly but I guess it’s to align the operations with the physical blocksizes. Anyway it might change the result of your test, especially if you sample throughput or IOPS with different blocksizes.

A good way to analyse this is the tools bitesize from BCC.

[root@test]# ./bitesize
Tracing block I/O... Hit Ctrl-C to end. ^C
Process Name = dd
     Kbytes              : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1000     |*****                                   |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 1000     |*****                                   |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 3000     |*****************                       |
       512 -> 1023       : 3000     |*****************                       |
      1024 -> 2047       : 7000     |****************************************|

Another method is to test with a defined blocksize and check the amount of IOPS done. This of course is more difficult on busy systems. The following script iterates over different blocksizes and displays the IOPS with dstst.

#/bin/bash

trap "echo exiting...; exit" SIGINT

IOSIZES=${1:-512 1024 2048 4k i8k 16k 32k 64k 512k 1M 4M}

for i in ${IOSIZES}; do
        echo "testing with blocksize of $i"
        echo "-----------------------------"
        dd oflag=direct if=/dev/urandom of=/test.img bs=$i &
        timeout 5 dstat -pcmrd
        killall -9 dd
        echo
done