By chance I found the Blog and Videos of Tanel Poder. The content is excellent if you want to get into low-level I/O debugging on Linux, I especially recommend the following blogpost with its corresponding video:
- https://tanelpoder.com/posts/high-performance-block-io-on-linux/
- https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/
Inspired by the detail of his analysis I started to dig around in Linux block I/O for a while. In this blogpost I collect some useful infos and tools for Linux storage debugging mostly as a notepad for myself. May it be helpful for you too.
Tools
bcc
The BPF Compiler Collection has a huge set of tools to sniff and snoop on various points of userland/kernel interaction.
No matter if you want to debug TCP connections, memory management or block I/O you’ll find something here.
The collection is available on most distros as bcc-tools
but the scripts are often not in $PATH
but must be called manually from /usr/share/bcc/tools/
.
All the tools use small BPF programs which are loaded into the kernel to probe a function or event and collect information about it.
If you’re interested in how to do that check out the cool
bcc Python Developer Tutorial.
Intels mlc: Memory Latency Checker
The Memory Latency Checker
is useful to check the throughput and latency from CPU to RAM.
Especially with NUMA systems this can be of great interest to find bottlenecks as the throughput
of a CPU might be capped if the memory latency is too high.
See also numactl and lstopo.
lstopo
When it comes to the physical layout of a machine there are tools like lshw
or lspci
which can give good information.
Additionally the command lstopo --of ascii
from the package hwloc
can draw a picture in a terminal of the hardware topology.
0x.tools
The tools on 0x.tools are interesting for Linux application debugging.
While I did not use them often yet, they contain good examples of how to use perf
to
get a tree visualization of where in the kernel code the CPU spends most time.
perf record -g -F 2 -a -o perf_log
perf report -i perf_log
dstat
dstat is the successor of the known tools like vmstat
, iostat
and ifstat
.
It aiming to unify the interface, making usage easier and adding more information.
See https://linux.die.net/man/1/dstat
for a full manual.
For debugging storage I often use dstat -pcmrd
to see IOPS and throughput like this:
[root@test /tmp]# dstat -pcmrd
---procs--- ----total-usage---- ------memory-usage----- --io/total- -dsk/total-
run blk new|usr sys idl wai stl| used free buf cach| read writ| read writ
1.0 0 | | 594M 257M 2172k 2670M| |
1.0 0 0| 1 55 42 0 0| 594M 185M 2172k 2742M| 0 184 | 0 120M
1.0 0 0| 1 57 41 0 0| 590M 131M 2172k 2801M| 0 175 | 0 164M
1.0 0 0| 1 57 41 0 0| 587M 121M 2172k 2812M|1.00 195 |4096B 164M
1.0 0 3.0| 1 57 40 0 0| 584M 111M 2172k 2826M|1.00 234 |4095B 160M
1.0 0 0| 3 57 41 0 0| 582M 104M 2172k 2836M| 0 222 | 0 148M
1.0 0 0| 0 62 30 6 0| 578M 109M 2172k 2835M| 0 460 | 0 365M
Here one can see, that one process writes rather large IOPS which occupies more than half a CPU with in kernel code.
Typical issues
Kernel splits I/O operations
Even though an application submits I/O with a specific blocksize to the kernel the operations might get split by the blocklayer before they get to the actual disks. I don’t still don’t know when this happens exactly but I guess it’s to align the operations with the physical blocksizes. Anyway it might change the result of your test, especially if you sample throughput or IOPS with different blocksizes.
A good way to analyse this is the tools bitesize
from BCC.
[root@test]# ./bitesize
Tracing block I/O... Hit Ctrl-C to end. ^C
Process Name = dd
Kbytes : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 1000 |***** |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1000 |***** |
128 -> 255 : 0 | |
256 -> 511 : 3000 |***************** |
512 -> 1023 : 3000 |***************** |
1024 -> 2047 : 7000 |****************************************|
Another method is to test with a defined blocksize and check the amount of IOPS done. This of course is more difficult on busy systems. The following script iterates over different blocksizes and displays the IOPS with dstst.
#/bin/bash
trap "echo exiting...; exit" SIGINT
IOSIZES=${1:-512 1024 2048 4k i8k 16k 32k 64k 512k 1M 4M}
for i in ${IOSIZES}; do
echo "testing with blocksize of $i"
echo "-----------------------------"
dd oflag=direct if=/dev/urandom of=/test.img bs=$i &
timeout 5 dstat -pcmrd
killall -9 dd
echo
done