blog.domainmess.org

Vintage Telephone Home Network

Sat, 04 May 2024 13:46:23 +0200

Through my work at Cosmic Cable Ltd. I had some experience with old analog & rotary telephones. Over the time my partner and I wished for some W48 or W49 telephones at home for internal calls. This blogpost will follow the different stages of this setup as (read if) it progresses. The plan The initial idea was to have 2 telephones one in the living room in the upper floor, the other one somewhere downstairs.

The plan

The initial idea was to have 2 telephones one in the living room in the upper floor, the other one somewhere downstairs. So as a first step I was searching for some kind of ATA hardware that is able handle to telephones and supports pulse dialing, which most of the cheap stuff doesn’t. Eventually I found the blog of a similar project called duck telecom which describes a way to configure a Grandstream HT802 to make direct calls between the two provides lines without the necessity of a external SIP server.

I ordered one and also a W48 and a W49, both in black at the bay, cleaned, tested and polished them.

Configuring the HT802

Following the duck telecom howto I was able to get two phone ready to call each other without an external SIP server. The HT802 uses DHCP per default, so once one finds the IP it got it can be access via HTTP and SSH.

Then first pulse dialing has to be enabled on each port. Then the dialplan of each port has get a short dial to the listing UDP port of the other one respectivly. For testing I just used 01 and 02 to call the two ports. So I had to add <02=*47127*0*0*1*5062> to the dialplan of the first port. This string consists of the number which will be replaces befor the = sign, the function call direct IP calling (*47) and then the IP and port to call (127.0.0.1:5060) delimited with *. Check the Local SIP Port setting to verify that you’re using the correct port.

After fiddling around a while with the configuration I also discovered how to use the CLI which is accessible via SSH. Every parameter has a unique P-Value number which one must know to get and set those values. These numbers can be found in the configuration template provided by Grandstream. Also the HT802 Manual can be helpful. Eventually the necessary config I set via SSH looked like:

Et voilà, first test successfull:
Video playing not supporeted, try a different browser.

# enter configuration mode
config

# enable pulse for both
set 20521 1
set 20522 1

# config pulse mode to general
set P28165 0
set P28166 0

# set dial plan
set 4200 { <02=*47127*0*0*1*5062> | x+ | \+x+ | *x+ | *xx*x+ }
set 4201 { <01=*47127*0*0*1*5060> | x+ | \+x+ | *x+ | *xx*x+ }

# apply changes
commit
exit
reboot

My first eBPF tool: bioslow

Thu, 18 Apr 2024 15:30:29 +0200

I always wanted to get hands-on experience with eBPF as it seems to be an exciting technology. A friend of mine recommended me the BCC Toolchain when I investigated some TCP socket issues and I was astonished how rich the possibilities are and how easy the code is compared to writing kernel patches. When I was working on a test setup for a storage solution, we discovered that the disks images for multiple virtual machines seem to have I/O operations taking longer than 10 seconds.

When I was working on a test setup for a storage solution, we discovered that the disks images for multiple virtual machines seem to have I/O operations taking longer than 10 seconds. We suspected the underlying NetApp Appliance to hang once in a while so I searched for a way to log those long I/Os.

Within the BCC Toolchain there are multiple tools for searching for long I/O operations but I missed something to set a threshold and actual write them to a log file. So I started to write a bcc style tool myself to do that.
To get started with eBPF in Python I can recommend this very good (but slightly outdated) bcc Python Developer Tutorial. And also the corresponding reference guide

In the end I actually created a tool call bioslow which is capable of logging all I/Os over a certain threshold in various formats:

[user@bcc-bioslow]$
sudo ./bioslow -t 15 -u ms
Logging I/O operations longer than 15 msecs

Tracing ... Ctrl-C to end.
2024-04-30 12:53:27.398519: 27 msecs
2024-04-30 12:53:35.941692: 55 msecs
2024-04-30 12:53:35.953540: 61 msecs
2024-04-30 12:53:35.953967: 61 msecs
2024-04-30 12:53:35.966807: 67 msecs
...

The tool is of course available on github.com/benibr/bcc-bioslow.

The eBPF code for this is rather small and just diffs the time between the start of a block I/O request and the end of it. Also the threshold is broken down to a FACTOR which is passed directly to the eBPF program to filter what is reported up to the python tooling.

BPF_HASH(start, struct request *);

void trace_start(struct pt_regs *ctx, struct request *req) {
    // stash start timestamp by request ptr
    u64 ts = bpf_ktime_get_ns();
    start.update(&req, &ts);
}

void trace_stop(struct pt_regs *ctx, struct request *req) {
    u64 *tsp, delta, factor = 0;

    tsp = start.lookup(&req);
    if (tsp != 0) {
        delta = bpf_ktime_get_ns() - *tsp;
        FACTOR
        if (delta >= factor) {
            delta /= factor;
            bpf_trace_printk("%d %d %d\\n",
                req->__data_len, req->cmd_flags, delta);
        }
        start.delete(&req);
    }
}

The main difficulties I had were: 1) Figuring out which kernel functions I actually need to attach to and 2) synchronizing the time between the monotonic time since boot that the kernel reports and datetime.now() from Python.
The next thing would be to rewrite the tool so that it uses BPF_PERF_OUTPUT instead of bpf_trace_printk.

Anyway, this was easier than I expected and real fun. I’m already looking forward to another opportunity to use eBPF.

TCP connection routing with IPtables

Mon, 12 Feb 2024 23:59:59 +0200

“Use less NAT” is a sentence I really like to hear. For a customer project we should build a high performace server for a webapplication. One of the requirements was that the ingress connections should not go though a loadbalancer or a NAT. Although that the throughput would probably not be throttled by those techniques and of course the application was really old Python software, but I liked the fact that the setup requirements were different than usual.

However the concept was that for an update all application containers should be stopped, updated and then started again which could mean several minutes of downtime. So I tried to find a new concept for a rolling update.

Rerouting new TCP connections

Without a classic loadbalancer usually a application binds directly to a IP:PORT combination, receiving all connections to this combination. If a newer version of this application should take over there must be some other mechanism to reroute new incoming connections.

First I tried to teach the application to use the SO_REUSEPORT socket option but soon I figured out that it would be too complicated to use that as a rollover mechanism. The details are described in the blogpost called Loadbalancing TCP connections in the Linux kernel.

My second attempt was to use IPtables to hijack all incoming connections and use a DNAT to route them to another application container. The nice thing with IPtables and conntrack is that established connections would be still using the original routing as long as the connection stays alive but the DNAT rule would be applied to new connections. This breaks the requriement for a NAT-less handling of connections but it might be only used during the upgrade. Additionally it is also possible to use IPtables to actually do wighted loadbalancing of connections. You can read a good explenation on this blogpost

Simulation and testing

# create a container to simlulate existing application
docker run -d --rm --network host --name app-v0.1 hashicorp/http-echo -listen=:8000 -text="app-v0.1"

# create an intermediate container with a newer version
# this container will listen on a different port
docker run -d --rm --network host --name app-v0.2-tmp hashicorp/http-echo -listen=:8081 -text="app-v0.2-tmp"

# create a temporary IPtables rule to reroute the traffic
# to the intermediate container
iptables -t nat -I PREROUTING -p tcp -m tcp --dport 8000 -j DNAT --to-destination :8081

Now one can wait until all connections to container01 are finished.

conntrack -L -p tcp --dport 8000
tcp      6 431975 ESTABLISHED src=10.82.3.224 dst=10.82.3.224 sport=43014 dport=8000 src=10.82.3.224 dst=10.82.3.224 sport=8000 dport=43014 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 84 TIME_WAIT src=10.82.3.224 dst=10.82.3.224 sport=60244 dport=8000 src=10.82.3.224 dst=10.82.3.224 sport=8000 dport=60244 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 71 TIME_WAIT src=10.82.3.224 dst=10.82.3.224 sport=47082 dport=8000 src=127.0.0.1 dst=10.82.3.224 sport=8081 dport=47082 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
conntrack v1.4.7 (conntrack-tools): 2 flow entries have been shown.

Be careful about the value of net.netfilter.nf_conntrack_tcp_timeout_established which tells conntrack when to forget about connections. Default is 12 hours.

When all connections to the old app-v0.1 are gone, we create a new container for permanent usage and remove the DNAT rule. Then again we do a graceful shutdown of app-v0.2-tmp and tear down the intermediate container.

# create a new permanent container with the updated version
docker run -d --rm --network host --name app-v0.2 hashicorp/http-echo -listen=:8000 -text="app-v0.2"

# remove the temporary DNAT rule
iptables -t nat -D PREROUTING -p tcp -m tcp --dport 8000 -j DNAT --to-destination :8081

Et voilà we upgraded to a new software version without a loadbalancer.

Foot notes

If you wanna test this on localhost you have to use this IPtables rule since traffic for the lo interface does not go though PREROUTING chain:

iptables -t nat -A OUTPUT -p tcp -o lo --dport 8000 -j REDIRECT --to-ports 8081

Usefull links

Loadbalancing TCP connections in the Linux kernel

Wed, 20 Dec 2023 16:36:34 +0100

During a search to make a Linux application suitable for rolling upgrades I search for a way to overtake a already bound TCP port. I realized that some versions of netcat are able to do that and also worked when I was using Dockers port forwarding feature. So I started to investigate how this works and what one could make of it. SO_REUSEPORT Apparently there is a kernel feature in Linux since version 3.

SO_REUSEPORT

Apparently there is a kernel feature in Linux since version 3.9 meant to handle exactly that kind of problem: Multiple applications or threads shall listen on the same address:port combination. If the socket option SO_REUSEPORT is set before the application binds, other processes with the same UID can attach to the same port. A lot of software has support for this, often if the work should be spread out over multiple processes. One example is the mpm module of the Apache webserver.
SO_REUSEPORT must not be confused with the SO_REUSEADDR option. An excellent explanation of their differences and usage in other operating systems can be found on StackOverflow

Load Balancing

Now there is one problem with the reusable ports. All the sockets that bind to the same address:port combination form a group and the kernel load balances all the incoming connections in a round-robin fashion. See the socket 7 manpage:

For TCP sockets, this option allows accept(2) load distribution in a multi-threaded server to be improved by using a distinct listener socket for each thread. This provides improved load distribution as compared to traditional techniques such using a single accept(2)ing thread that distributes connections, or having multiple threads that compete to accept(2) from the same socket.

Yeah make sense for a high performance webserver but not when I wanna try to hand over a incoming connections to a new process.

Some tests

To test how this behaves I created a Apache config with Listen 80 reuseport and ListenCoresBucketsRatio 2. With this config Apache enables the SO_REUSEPORT option on its socket. Sadly there is no easy way in Linux to show all the options of a socket in human readable form.
I used knetstat to verify that Apache’s working correctly.

$ cat /proc/net/tcpstat 
Recv-Q Send-Q Local Address           Foreign Address         Stat Diag Options
     0      0 0.0.0.0:80              0.0.0.0:*               LSTN      SO_REUSEPORT=1,SO_REUSEADDR=1,SO_KEEPALIVE=0,TCP_NODELAY=0

My idea was to start two of those Apaches and simulate a fault state in one and see how the incoming connections will spread. Therefore I ran the following things as a basic simulation:

# launch to Apaches, on after the other
docker run -d --name test1 --network host httpd:bookworm -v apache-test.conf:/etc/apache2/apache2.conf 
docker run -d --name test2 --network host httpd:bookworm -v apache-test.conf:/etc/apache2/apache2.conf 

# then get the PIDs from one of the containers
ps auxf | grep apache

# and pause all the processes to prevent them from answering on their socket
for i in $(seq 6031 6062); do kill -STOP $i; done

# then test the connection distribution with
for i in $(seq 1 100); do timeout 1 curl https://localhost -k -s -o /dev/null && echo worked || echo failed; done | sort | uniq -c

The result looks somehow like this:

   46 worked
   54 failed

So in the end I realized this is not really what I was searching for. I wanted a smooth handover of a bound socket to a new one and SO_REUSEPORT is more for balancing connections between multiple processes serving the same content for performance optimization.

eBPF as usual

However, I realized afterwards that one could write a (e)BPF program and attach it to the groups of socket that are using the same port. There one has full control over how the connection are distributed although it’s not straight forward due to possible socket reordering. The nice thing although is, that one can overwrite that program during runtime without restarting the actual serving application. The socket 7 manpage says:

For use with the SO_REUSEPORT option, these options allow the user to set a classic BPF (SO_ATTACH_REUSEPORT_CBPF) or an extended BPF (SO_ATTACH_REUSEPORT_EBPF) program which defines how packets are assigned to the sockets in the reuseport group. … Sockets are numbered in the order in which they are added to the group (that is, the order of bind(2) calls for UDP sockets or the order of listen(2) calls for TCP sockets). New sockets added to a reuseport group will inherit the BPF program. When a socket is removed from a reuseport group (via close(2)), the last socket in the group will be moved into the closed socket’s position.

This is good to know but still doesn’t solve my problem %) I’ll probably write about an alternative solution soon.

EDIT: There is also a example available of how create Hot standby load balancing with SO_REUSEPORT and eBPF from Hemanth Malla.

Linux Block I/O debugging

Sat, 14 Oct 2023 21:16:18 +0200

By chance I found the Blog and Videos of Tanel Poder. The content is excellent if you want to get into low-level I/O debugging on Linux, I especially recommend the following blogpost with its corresponding video: https://tanelpoder.com/posts/high-performance-block-io-on-linux/ https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/ Inspired by the detail of his analysis I started to dig around in Linux block I/O for a while. In this blogpost I collect some useful infos and tools for Linux storage debugging mostly as a notepad for myself.

Inspired by the detail of his analysis I started to dig around in Linux block I/O for a while. In this blogpost I collect some useful infos and tools for Linux storage debugging mostly as a notepad for myself. May it be helpful for you too.

Tools

bcc

The BPF Compiler Collection has a huge set of tools to sniff and snoop on various points of userland/kernel interaction. No matter if you want to debug TCP connections, memory management or block I/O you’ll find something here. The collection is available on most distros as bcc-tools but the scripts are often not in $PATH but must be called manually from /usr/share/bcc/tools/. All the tools use small BPF programs which are loaded into the kernel to probe a function or event and collect information about it. If you’re interested in how to do that check out the cool bcc Python Developer Tutorial.

Intels mlc: Memory Latency Checker

The Memory Latency Checker is useful to check the throughput and latency from CPU to RAM. Especially with NUMA systems this can be of great interest to find bottlenecks as the throughput of a CPU might be capped if the memory latency is too high.
See also numactl and lstopo.

lstopo

When it comes to the physical layout of a machine there are tools like lshw or lspci which can give good information. Additionally the command lstopo --of ascii from the package hwloc can draw a picture in a terminal of the hardware topology.

0x.tools

The tools on 0x.tools are interesting for Linux application debugging. While I did not use them often yet, they contain good examples of how to use perf to get a tree visualization of where in the kernel code the CPU spends most time.

perf record -g -F 2 -a -o perf_log
perf report -i perf_log

dstat

dstat is the successor of the known tools like vmstat, iostat and ifstat. It aiming to unify the interface, making usage easier and adding more information. See https://linux.die.net/man/1/dstat for a full manual. For debugging storage I often use dstat -pcmrd to see IOPS and throughput like this:

[root@test /tmp]# dstat -pcmrd
---procs--- ----total-usage---- ------memory-usage----- --io/total- -dsk/total-
run blk new|usr sys idl wai stl| used  free  buf   cach| read  writ| read  writ
1.0   0    |                   | 594M  257M 2172k 2670M|           |
1.0   0   0|  1  55  42   0   0| 594M  185M 2172k 2742M|   0   184 |   0   120M
1.0   0   0|  1  57  41   0   0| 590M  131M 2172k 2801M|   0   175 |   0   164M
1.0   0   0|  1  57  41   0   0| 587M  121M 2172k 2812M|1.00   195 |4096B  164M
1.0   0 3.0|  1  57  40   0   0| 584M  111M 2172k 2826M|1.00   234 |4095B  160M
1.0   0   0|  3  57  41   0   0| 582M  104M 2172k 2836M|   0   222 |   0   148M
1.0   0   0|  0  62  30   6   0| 578M  109M 2172k 2835M|   0   460 |   0   365M

Here one can see, that one process writes rather large IOPS which occupies more than half a CPU with in kernel code.

Typical issues

Kernel splits I/O operations

Even though an application submits I/O with a specific blocksize to the kernel the operations might get split by the blocklayer before they get to the actual disks. I don’t still don’t know when this happens exactly but I guess it’s to align the operations with the physical blocksizes. Anyway it might change the result of your test, especially if you sample throughput or IOPS with different blocksizes.

A good way to analyse this is the tools bitesize from BCC.

[root@test]# ./bitesize
Tracing block I/O... Hit Ctrl-C to end. ^C
Process Name = dd
     Kbytes              : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 1000     |*****                                   |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 1000     |*****                                   |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 3000     |*****************                       |
       512 -> 1023       : 3000     |*****************                       |
      1024 -> 2047       : 7000     |****************************************|

Another method is to test with a defined blocksize and check the amount of IOPS done. This of course is more difficult on busy systems. The following script iterates over different blocksizes and displays the IOPS with dstst.

#/bin/bash

trap "echo exiting...; exit" SIGINT

IOSIZES=${1:-512 1024 2048 4k i8k 16k 32k 64k 512k 1M 4M}

for i in ${IOSIZES}; do
        echo "testing with blocksize of $i"
        echo "-----------------------------"
        dd oflag=direct if=/dev/urandom of=/test.img bs=$i &
        timeout 5 dstat -pcmrd
        killall -9 dd
        echo
done

LAD23

Wed, 04 Oct 2023 15:59:49 +0200

In Octobre 2023 I attended the LAD23 in Bordeaux: the Lustre Administration and Development workshop. I was working on a ~15 PB (raw capacity) storage system for a University which was based on the Lustre filesystem and my colleague and I wanted to get some information since we both didn’t any experience with Lustre yet. It was also my first conference in person since the FOSDEM20. Conference and Organization The conference was organized by the EOFS took place in the InterContinental Hotel in the city centre of Bordeaux and the Social Event on the first evening was at Château Pape Clément in the vinyards besides the city.

Conference and Organization

The conference was organized by the EOFS took place in the InterContinental Hotel in the city centre of Bordeaux and the Social Event on the first evening was at Château Pape Clément in the vinyards besides the city. Both places were a bit posh but served good wine and catering and featured nice decor.

Lustre News & Roadmap

A big part of the conference is the current state of Lustre development. Traditionally the first talk is a presentation of the Lustre roadmap and upcoming features Most interesting for us is the Erasure Coding functionality which is planned to be available in Lustre 2.17, expected at the end of 2024. This allows a user to get data redundancy without using a RAID configuration below the Lustre system, resulting in more flexibility.

Many of the following talks were then Lustre developers presenting their work and users explain how they found bugs and how to fix them. Here is a list of the things I found most interesting which is of course just an excerpt:

Lustre & Kerberos: A lot of work is done here mainly to allow the integration of lustre in Enterprise/Cloud environments with different users and multi tenancy support. With Kerberos based authentication it is also possible to prevent unauthorized Lustre servers from joining the cluster. Another problem that can be addressed is the fact that Lustre trusts the root user on all clients per default. If an attacker becomes root on one client he can read all data. (This problem can also partially be mitigated by using nodemap) The whole Kerberos related code got a major rewrite and a lot of bugfixes coming in Lustre 2.16.
Other people are working on Client Side data compression which directly compresses data on the client before it is send over the network. This solution does not depend on ZFS compression on the server which is a common usecase today when Lustre is used with ZFS. The main question about this was if the OSD is able to decompress the stored data and then only serve the parts that were requested by the client and not the whole compressed block which could lead to traffic amplification.
One of the most impressive announced features was: Unaligned Direct I/O Lustre tries to combine the advantages of buffered I/O and direct I/O from 2.16 on. It automatically uses direct I/O for bigger files and the page cache for smaller files. The alignment for bigger files is done using a aligned buffer in kernel memory which is way faster than using a cache and only little slower than direct I/O. This may increase throughput drastically without the necessity to change the user application.
lljobstat is a new tool to debug slow I/O on MDTs/OSTs which has profen very useful according to the developers.
Better fscrypt support to move data without knowing the key which had some pitfalls before. This will be released even within Lustre 2.15
The Lustre filesystem is also being prepared for larger devices, with up to 1.5PB per OST
And support for Hybrid setups (NVMe+HDD) in ldiskfs is evaluted

Robin Hood Filesystem Utilities

I never heard of the robinhood software suite prior to this event. Librobinhood is an efficient C-API to store and query any filesystem’s metadata in a efficiant way. This is usually done by loading the metadata into a MongoDB database for searchablity. The tools support POSIX, Lustre and MongoDB as backend and allows fast searching, filtering and changing metadata.

What I really liked in their talk this year is that they implemented a expire date for files which is currently working only on Lustre backend but general POSIX support is planned. The date is saved as a user visible extended attribute and can later be found with the robinhood find tool. Besides that the tool can also have complex filters or search for other extended attributes.

Random notes

Here are some further random notes and learnings I took during the conference.

LNet Network Selection Policy (UDSP) allow to the priority of LNet links. This can result in equal/weighted loadbalanching between LNet devices. Also it’s possible to use specific links only if no other is available (fallback). See lnetctl set heath and lnetctl set priority
The Linux kernel may split a eg. 16M IO operation into smaller IOs. One can use blktrace or perf to check if submitted IOs are the same as what is iactually send to disk.
One result from a benchmark analysis talk of HPE: ZFS compression underneath Lustre is worth it and doubled throughput (in Benchmarks, at best case).

Personally I realized during the conversations on this conference which was my first HPC related. First that after getting used to HPC workloads everything else seems too small in size. People here are talking about a 800TiB NVMe + 1.5 PiB HDD storage as the “playground” where their users can test things and it’s rather common that people have to move 10-50 PiB form one storage to another.

Secondly benchmarking is a important part of storage engineering, besides fio which I knew there is also ior, io500, mdtest. Benchmarking should be a standard thing to do for a storage engineer for having a clear idea what to expect from hardware before, and realizing misconfiguration during production use.

As Sergey Kachkin said: “Storage is developed with benchmarks, tested with benchmarks, sold with benchmarks, only users actually have workloads”

Routing a specific port though wireguard VPN

Mon, 18 Sep 2023 14:51:16 +0200

Yesterday a friend ask me if it is possible to route the outgoing traffic of a Linux machine with a specific destination port via a wireguard VPN. He used a publicly accessible proxy host to forward SMTP (port TCP/25) via this VPN to a machine in his home network and now he wanted the mailserver to push all the outgoing mail also though this VPN connection to the proxy. However there should no mail relay be involved.

My idea was to mark all outgoing traffic with a IPtables rule and use a separate routing table to send it though the VPN instead of the default gateway of the home network (policy based routing). After some research I found out that it is actually possible to do this with NFtables based systems. Although the initial routing decision for locally generated packets is done in Linux before one is able to alter them via firewall rules, there is a possibility to change their route afterwards.

As the following flowchart shows, there is a “reroute check” after the OUTPUT chains of all tables and after that you can still use the POSTROUTING chains.

Netfilter Packet Flow image, published on Wikipedia, CC BY-SA 3.01

That means we first use the following firewall rules to mark all the outgoing SMTP traffic. The number in the firewall mark is arbitrarily chosen.

iptables -t mangle -I PREROUTING 1 -j CONNMARK --restore-mark
iptables -t mangle -I OUTPUT 1 -p tcp --dport 25 -j MARK --set-mark 0x25
iptables -t mangle -I OUTPUT 2 -j CONNMARK --save-mark
ip6tables -t mangle -I PREROUTING 1 -j CONNMARK --restore-mark
ip6tables -t mangle -I OUTPUT 1 -p tcp --dport 25 -j MARK --set-mark 0x25
ip6tables -t mangle -I OUTPUT 2 -j CONNMARK --save-mark

Then we add a new routing table and add a single default route via the VPN to it; this work because wireguard interfaces are layer 3 only so there is no need for a gateway. The table number is also arbitrarily chosen.

ip route add default dev wg0 table 25
ip -6 route add default dev wg0 table 25
# optional: naming the routing table
echo "25  smtp" > /etc/iproute2/rt_tables

Now we force the marked traffic to be routed with that newly created table with a IP rule. This all happens during the “routing recheck” after the traffic has been marked.

ip rule add from all fwmark 0x25 lookup 25
ip -6 rule add from all fwmark 0x25 lookup 25

The packet would now be sent via the wireguard interface but will have the source address that was selected in the first routing decision. There we need to change the packet again in the POSTROUTING tables and “masquerade” them with the correct source address. This isn’t a NAT event yet, since the packet has been created locally and is sent on this interface for the first time.

iptables -A POSTROUTING -m mark --mark 0x25 -j MASQUERADE
ip6tables -A POSTROUTING -m mark --mark 0x25 -j MASQUERADE

Now we can test if a TCP connection to port 25 is actually going via the VPN:

tcpdump -nni wg0 port 25 &
nc -vz4 gmail-smtp-in.l.google.com. 25
nc -vz6 gmail-smtp-in.l.google.com. 25

If you see at least a SYN packet on the VPN interface, then it works. Although the reverse path might not yet work. To allow the machine to receive answers for this connection you must first set the reverse path filtering in Linux to loose (2) or turn it off completely (0). See also the parameter doc on sysctl-explorer.net or kernel.org

sysctl -w net.ipv4.conf.wg0.rp_filter=2
# or
sysctl -w net.ipv4.conf.all.rp_filter=0

Also if the VPN interface does not have a public address a forwarding and NAT must be configured on the proxy host.

Remove passwords from Git repository

Wed, 19 Apr 2023 19:18:47 +0200

When it comes to make code or config publicy available as open source one has always to make sure that the repo doesn’t contain any sensitve information. To remove stuff like passwords from various files in all commits i use bfg. First I clone a single branch from a local repo which should be adjusted for public. Although I use a seperate branch I also use a seperate git directory because

When it comes to make code or config publicy available as open source one has always to make sure that the repo doesn’t contain any sensitve information.

To remove stuff like passwords from various files in all commits i use bfg.

First I clone a single branch from a local repo which should be adjusted for public. Although I use a seperate branch I also use a seperate git directory because
Passwords will be cleaned from all refs/branches, not just the current.

git clone --single-branch --branch main file://$(pwd)/repo/ repo-public-bfg/

Then I search for passwords which are not comments:

git grep -Eih (pass|password|auth) | grep -v "^[\s]*;"

Then all the passwords are written to a file eg. passwords.txt. It can contain simple strings, one per line which will be replaced by ***REMOVED*** or you can define a search/replace combination using ==> as delimiter. Also regexes are possible for matching. See also this example.

password
"password"
=password==>=__redacted__
password2==><place_password_here>
regex:password=[0-9]+==>password=

Once everything is ready bfg does the actual work and git can do some cleanup:

bfg --replace-text ../passwords.txt . --no-blob-protection --filter-content-excluding "*.jpg"
git reflog expire --expire=now --all && git gc --prune=now --aggressive

--no-blob-protection removes the passwords from all commits even the current one.
With --filter-content-excluding you can exclude files that shouldn’t be altered.

Afterwards all the passwords listed in password.txt will be redacted in all commits! To track what has changed and prevent dataloss the passwords are all available as a staged commit. You can view them with git diff --staged or drop them with git stash; git stash drop

Toronto ||

Fri, 30 Dec 2022 10:54:05 +0100

Together with a friend, I found a old loudspeaker box in the storage of the CCC which contained the material of the Blinkenlights project. Both of us took one of them since they are beautiful and have the perfect format for a bluetooth sound box and include a nice rotary control element for volume. My friend guessed these speakers where used as a monitor system in dressing rooms and he started replace the speaker inside by a stereo system with bluetooth.

Inspired by his work and blogpost I decided to build my own but I choose to use a Mono speaker and amplifier combined with a bluetooth receiver and 3 Li-Ion 18650 cells.

Referring to his I just called mine „Toronto ||“

Mikrotik router as Wireguard VPN gateway

Fri, 11 Nov 2022 02:09:04 +0200

In my new apartment I luckily have a fibre (FTTH) which terminates in the living room. My first excitement was curbed when I investigated about the technology in use. The ISP uses a technique called GPON which is probably smart from a economical point of view but also enables some possible (sniffing) attacks. Additionally the provider seems to do some filtering which is not just DNS blocking. Long story short I decided to route all traffic through a VPN and so I began to config a Mikrotik hex POE router to use a Njal.

Configuring the router

This is the first time I actually use Mirotik router in non default config so it took quite a while for me to figure out how everything fits together. The requirements are:

DHCP Client on Uplink port
working Wireguard tunnel
default route through the wireguard tunnel
forbid traffic to go directly to the uplink (eg. on Wireguard connection loss)
DHCP server on all other interfaces
working DNS resolver on the router
working IPv6

DHCP Client + LAN Bridge

First we setup a standard behaviour: use the ether1 port for uplink as DHCP client and bridge the remaining interfaces together.

/interface list member
add comment=uplink interface=ether1 list=WAN
/ip dhcp-client
add add-default-route=no comment=uplink interface=ether1 use-peer-dns=no \
    use-peer-ntp=no

/interface bridge
add admin-mac=AA:BB:CC:DD:EE:FF auto-mac=no comment=local-bridge name=bridge \
    protocol-mode=none
/interface bridge port
add bridge=bridge comment=defconf interface=ether2
add bridge=bridge comment=defconf interface=ether3
add bridge=bridge comment=defconf interface=ether4
add bridge=bridge comment=defconf interface=ether5
add bridge=bridge comment=defconf interface=sfp1

Be aware, that we do not want to automatically set a default route to the gateway of the uplink network add-default-route=no since we want all traffic go through the VPN at all times.

DHCPv4 Server

DHCPv4 is also a very common configuration which, I think, does not need further explanation. The hex POE router has DHCPv4 on its LAN bridge by default enabled. Anyway here is a configuration example with one static lease:

/ip pool
add name=domainmess-home-dhcp-range ranges=192.168.0.100-192.168.0.200
/ip dhcp-server
add address-pool=domainmess-home-dhcp-range authoritative=yes disabled=no interface=bridge \
    lease-script="" lease-time=10m name=domainmess-home-dhcp-server use-radius=no
/ip dhcp-server lease
add address=192.168.0.200 client-id=\
    aa:bb:cc:ee:ff:aa:bb:cc:ee:f:faa:bb:cc:ee:ff:aa:bb:cc:ee:ff mac-address=\
    AA:BB:CC:DD:EE:FF server=domainmess-home-dhcp-server

DHCPv6 Server

DHCPv6 isn’t really the right term as the Mikrotik RouterOS does prefix delegation. That means it announces a prefix which the client can get and might use for further delegation or address selection. I use it to assign a /64 network to each client which then will configure automatically one address from that.

/ipv6 pool
add name=domainmess-home-ipv6-ula prefix=fc00:99ff:4400::/32 prefix-length=64
/ipv6 dhcp-server
add address-pool=domainmess-home-ipv6-ula dhcp-option="" disabled=no interface=bridge lease-time=3d \
    name=domainmess-home-dhcpv6-pd preference=255 rapid-commit=yes route-distance=1 \
    use-radius=no

Wireguard

Wireguard is new in RouterOS 7 and is integrated quite well into the Mikrotik config. We define a interface of the type „Wireguard” and at least on Peer to which we want to connect to. In our case that is the Njalla VPN server. The other settings come from the Njalla config, but the important one is that we set 0.0.0.0/0 and ::/0 in the AllowedIPs so that we can route all internet traffic through it and not just a specific subnet.

# keys are just examples
/interface wireguard
add disabled=no listen-port=13231 mtu=1280 name=njalla01 private-key=\
    "gDV016J81d9dWkkw7j9MjDcnmt0HhQHnsG7favtiJU8="

/interface wireguard peers
add allowed-address=0.0.0.0/0,::/0 disabled=no endpoint-address=\
    wg235.njalla.no endpoint-port=51820 interface=njalla01 \
    persistent-keepalive=25s public-key=\
    "YAV6HyThvZgLTYD6nDIeMeUNKgwnp6jnLdiYYNb9Bnc="

Hint: Using a hostname instead of an IP as EndPoint is supported since RouterOS 7.6

Routes + DNS

Now Wireguard isn’t able to connect to its remote, since there is no working Routing+DNS. Setting DNS servers is trivial but we need also to route all encrypted Wireguard traffic to the Njalla network via our uplink port. Same goes for at least one DNS server so that Wireguard can lookup its remote.

Be aware: This will make at least some your DNS request bypass the VPN and be victim to censorship or monitoring. To prevent that, use a static IP as Wireguard remote!

As a default gateway we then set the Wireguard Interface instead of a IP address. That is possible since Wireguard is an IP protocol (Layer 3) and creates a TUN device which has no need for a Ethernet header with a gateway MAC address inside it. The traffic must be routed anyway on the other side of the tunnel (on the Njalla routers) and we therefore can just send IP packets into it.

# route to njalla network via uplink
/ip route
    add disabled=no distance=1 dst-address=198.167.192.0/19 gateway=[/ip dhcp-client get [find interface=ether1] gateway]
    pref-src="" routing-table=main suppress-hw-offload=no
# default route via njalla VPN
add disabled=no distance=1 dst-address=0.0.0.0/0 gateway=njalla01 pref-src="" \
    routing-table=main suppress-hw-offload=no
# DNS bypass
add disabled=no distance=1 dst-address=9.9.9.9/32 gateway=192.168.1.254 \
    pref-src="" routing-table=main suppress-hw-offload=no

IPv4 NAT

Like on more or less every home route IPv4 source NAT is used to use the IPv4 VPN address for all outgoing traffic. This is also part of the default conf of many Mikrotik router.

/ip firewall nat
add action=masquerade chain=srcnat comment="defconf: masquerade" \
    ipsec-policy=out,none out-interface-list=WAN

IPv6 NAT

This is definitely the most ugly part. Since Njalla only gives one IPv6 address per VPN connection we have to setup a IPv6 source NAT (yes, that is possible, didn’t believe that either…)

/ipv6 firewall nat
add action=masquerade chain=srcnat ipsec-policy=out,none out-interface=\
    njalla01

Works as designed and is better than no IPv6 at all!

configuring the clients

To use the IPv6 prefix delegation we configured before in the DHCPv6 Server section, we must also ensure that the client is asking for it. For systemd-networkd this is done by adding the following lines to the interface configuration:

[Network]
DHCP=yes
DHCPPrefixDelegation=yes

Additional learnings

During the whole configuration I learned some related stuff which came in quite handy for using and debugging Mikrotik Routers.

Find IPv6 hosts in network

To find IPv6 hosts in the local network one doesn’t need to nmap but one can use multicast packets on which specific hosts will answer. For example to find all routers you can just ping

$ ping -6 ff02::2%wlan0

This way you can find a router even if no DHCP is working in your LAN or if the router has no static IP due to some configuration errors.

Use Link-Local addresses

Of course you can also use the Link-Local address to access your router. Well at least you can if you don’t want to use the Webinterface because I couldn’t find any browser that properly supports using Link-Local addresses. That is not a new phenomenon and is otherwise documented. However accessing the router via SSH via it’s link local address even works if’s been rebooted with completely network config.

Ansible Role: Proxmox VM Setup

Sun, 18 Sep 2022 16:50:34 +0200

In my current job we use Proxmox clusters as VM hosts and a lot of Debian VM on them. Most of the VMs are old and handcrafted but for some of them like Jenkins agents and Gitlab runners I created Ansible playbooks to configure them. Of course this is rather useless if there is no way to automatically spawn the whole VM before configuring it and so I started to create a role which creates the VM if it’s not already existing.

The basic concept is simple. On every physical host there is a template VM which contains a untouched Debian Cloud image. If the VM doesn’t exist, this template is cloned and the VM parameters are set via the Proxmox API. The initial config which is necessary for Ansible to ssh into the VM later are passed via cloud-init.

The role follows the same basic concept of my systemd-nspawn container role, all the parameters of the created machines shall be defined in a host_vars/name.yml looking like this:

---
physical_host: kvm-server-99
vm_cores: 32
vm_memory: 32768
vm_disksize: "20G"
vm_template: "debian11-cloudinit-template"

With that set it is enough to call the role as first task of the playbook and the VM will be spawned.

Over time the role became more complex so I decided to release it on Github to share it with the community.
https://github.com/benibr/ansible-role-proxmox-vm-setup

Cosmic Cable Ltd.

Wed, 31 Aug 2022 09:26:46 +0200

In 2016 and 2022 I participated in a project called “Cosmic Cable Ldt.", a fictional company which provides telephone booths at Fusion Festival. This was one of the most varied and beautiful projects I’ve ever done as it combines work with metal and wood, art and decoration with a technical stack from old dial phones to modern VoIP networking. As I love doing all of this at once, working in a small group of friends on a weird techno festival environment fits me just perfect.

About Cosmic Cable Ltd.

Comsic Cable is the lost cause of a post-bureaucratic telephone network provider mostly run by a crew of rogue technicians. Founded in 2016 as an alternative to the bad mobile network at Lärz, Germany it provides the most stylish telephone booths at both Fusion Festival and at.tension Festival to connect guests and crew. The telephone network is complemented by additional well-known Cosmic Cable services like the “Status Box” which provides asynchronous communication for guests as well as yearly changing entertainment options.

Website: http://cosmic.cable.limited
Github Account: https://github.com/cosmiccableltd/

Technical Setup

Within the Cosmic Cable network everything starts with a classic rotary dial telephone. Although dial phones are awesome, there exist considerably constraints when using them. The most important is that with pulse dialing you can only dial once per call. The moment the dialing is complete you can not dial additional numbers, what prohibits any kind of interactive menus.
Since building a whole analogue telephone network is very time consuming we use old Fritz Boxes to adapt the telephone to a TCP/IP based network. Models which are known to work well with pulse dialing are e.g. Fritz Box Fon ATA.

From then on the we can use all kinds of standard networking gear. On the festival site we often use Ubiquiti Nanostations to establish radio links when no cable links are available.

As a software PBX we run a Asterisk setup which implements the service features that Cosmic Cable offers. The configuration is available publicly on Github.

Another important part of attractive telephone booths is the audiovisual ringing which combines the classical ringtone with some other effect like blinking lights or a flamethrower. To trigger such events we have used two different concepts over the years.

First we had selfmade relay or other driver PCBs on the second analogue port on the Fritz Box. The Asterisk server then automatically called the corresponding number on specific events.

Later on we developed the so called Mystery Box which has a analogue telephone pass-through and a 12V output which is constantly toggled as long as there’s ringing on the line. The box also provides connection terminals for 12V powered light as well as a passive POE injector to drive the Nanostations.

mkosi

Fri, 08 Jul 2022 12:41:51 +0200

Installing (Linux) operating systems is a time consuming and, even worse, repetitive task. The whole “burning to disk” era is definitely over and bootable USB sticks are useful in some cases (eg. installing a new workstation) but let’s be honest, most of the time one installs an OS to be used in some kind of VM or container. For this use case live systems with crude installer software are just not the right tool. PXE based setups with preseed or kickstart are, in my humble opinion the even more wrong since they try to automatically operate an installer which is meant for human interaction.

There are some more modern approaches like cloud-init which are way better to handle but they still expect some kind of machine to boot and then do the installation steps.

What defines a OS installation?

When someone wants to install an operating system they basically want to define the following things:

Where? What kind of storage/filesystem is used?
What? Which kernel, distribution and arbitrary software should be installed?
How? How should it boot and how can I reach it afterwards?

For virtual machines the whole setup can be pre-defined and there is no need for interactive options or guided installers. I just want to define the result, generate it and be able to start it right away. Doing all of that manually is properly annoying, so I wrote a lot of Ansible code to automatically generate container filesystems or full VMs.

mkosi is what I was missing!

A while ago I stumbled upon mkosi, a tool from the systemd suite which solves exactly that expectation. It can generate OS filesystem trees of many different distributions by using there native bootstrapping tools (eg. debootstrap) and can additionally package the result in an image file with a filesystem and bootloader. You can use mkosi as a single command or feed it configuration files. So it does solve the whole “Where?” and “What?” questions and even the “How?” by being able to set passwords, hostname and SSH keys inside the resulting OS to ensure a user can login after boot.

That’s exactly how I want to install a operating system in 2022! No matter if I’m generating a container for testing or if I want to install a Linux on a physical machine from a arbitrary live system.

Usage

The usage is quite straight forward. Although there aren’t many howtos and examples available yet. Of course the manpage is worth a look but also the Archlinux Wiki and Lennarts blogpost have some really interesting informations.

Creating a Debian container (systemd-nspawn)

First a simple task, we create a filesystem tree containing a bootable Debian and run it as a container

# create a directory with a debian system
mkosi --format directory --distribution debian --release buster --output test-container --cache /tmp/cache --hostname reality-check.example.com --password foobar

# boot it as container with systemd-nspawn
systemd-nspawn --boot --directory test-container

Creating a Archlinux VM from config file (qemu)

Now we create a full featured image which can be run as a QEMU instance. Since we cannot simply create a shell inside such a VM, we have to establish a SSH connection. mkosi does all of that for us.

# create a image with GPT and BTRFS and install ssh
[Distribution]
Distribution=arch

[Output]
Format=gpt_btrfs
Bootable=yes
Output=test-image.img

[Packages]
Packages=openssh

[Validation]
Password=foobar

[Host]
Ssh=yes
Netdev=yes

# build the image from the default config file but use hosts package repos and cache
# hint: parameters must come before command, otherwise they're ignored
mkosi --cache /var/cache/pacman/pkg/ --use-host-repositories build

# run the generated image with QEMU
mkosi qemu &

# connect to it via SSH
# hint: the created network interface on the host is automatically configured by systemd-networkd
# hint: the password is also used as password for the SSH key
mkosi ssh

I’ve never felt so satisfied after an OS installation ever before

Installing Fedora in a squashfs (qemu)

Now inside the new VM we just overwrite the whole image :-P

# install dependencies
pacman -Sy mkosi dnf squashfs-tools

# create a new OS image with another password and a Fedora system
mkosi --cache /tmp -d fedora --password test -o image.raw --force --format gpt_squashfs

# move it to the 
# ATTENTION: make sure you run this inside the right shell!
dd if=image.raw of=/dev/sda; sync
poweroff

Afterwards when the VM stopped we can start the image again and voilà a Fedora is booting:

mkosi boot

I feel kinda prepared for any upcoming install parties. Remember the multiboot USB sticks with different OSs on it? Those were times.

Looking forward

Knowing about mkosi will probably change the way how I’ll setup machines in future. It looks like the tool is still in updraft and I expect a whole lot of new supported systems and features. For example nixOS is not yet supported but might fit perfectly.
Another interesting area might be the creation of images for embedded devices like my favorite APU boards or might one even squint in direction of postmarketOS?!

Adjustable notebook stand made of wood

Sat, 29 Jan 2022 15:18:27 +0100

To work without external display I wanted to have a laptop stand which needs less space on the table than the laptop itself.

Inspired by DIY notebook stands like this one made from steel pipes this one from cork cork or this one made of wood, I wanted to build one too. Of course when the stand should be smaller than the notebook on top of it, the barycenter has to be placed well to prevent wobbling. I decided to build something from leftover wood as it also might work as counter weight.
As base I used a wooden cylinder in which i drilled pairs of holes on different heights. A smaller cylinder can be plugged into them so adjust the slope. The top is a pressed cork floor tile with a medium raw surface and a flat wooden plank glue to stop anything from gliding down.

Sharing filesystems with virtiofs between multiple VMs

Tue, 09 Nov 2021 16:41:18 +0100

Sharing data between VMs is a headache every admin had once in a while. Especially when multiple virtual machines are expected to have read/write access. Besides the oldfashioned ways like NFS and the new shiny stuff like Ceph, the ‘simple’ sharing of a filesystem from host to VM fits in. Although some solution can configure this with one click, I wanted to understand with which technology I can do this manually.

Although some solution can configure this with one click, I wanted to understand with which technology I can do this manually. I found virtiofs which allows as FUSE process on the host to directly pass filesystem access to a VIRTIO device inside the VM. This means not network protocol is involved, restricting the access to local VMs on the host. But they can use the hosts Page Cache if so chosen.

So let’s get to it:

virtiofsd is part of qemu-system-common on Debian or, if you use Proxmox like I did in this case, you can find it under /usr/lib/kvm/virtiofsd.

So we start two instances of the daemon sharing the same directory:

/usr/lib/kvm/virtiofsd --socket-path=/tmp/vm1 -o source=/root/multi-test -o cache=auto -o debug
/usr/lib/kvm/virtiofsd --socket-path=/tmp/vm2 -o source=/root/multi-test -o cache=auto -o debug

They open a socket which is now passed to the VM via the QEMU parameters. When using Proxmox, add the following lines to /etc/pve/nodes/$HOSTANME/qemu-server/$VMID.conf

args: -chardev socket,id=char0,path=/tmp/1 -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=virtiofs_vm_1 \
      -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/shm,share=on -numa node,memdev=mem

args: -chardev socket,id=char0,path=/tmp/1 -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=virtiofs_vm_2 \
      -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/shm,share=on -numa node,memdev=mem

tag=virtiofs_vm_X defines the name under which the device will be available inside the VM

See virtiofs howto for further details

Then poweroff and start the VM to enable the additional device.

If you’re running a Linux kernel >5.4 inside the VM, it supports virtiofs natively:

mount -t virtiofs virtiofs_vm_1 /mnt

Now the host filesystems is mounted to the VM.

The VM are not yet configured to access the hosts cache since this method is, yet, still experimental.

Ubuntu created with debootstrap not recieving updates

Tue, 26 Oct 2021 13:44:02 +0200

A while ago I realised, that some of my ubuntu systems, most of them nspawn containers, don’t receive any updates. I didn’t really care about the containers but as I realised that one of my internet facing host is also affected, I began to search. I searched on https://packages.ubuntu.com/ for the newest kernel and locally with apt-cache policy linux-generic for the newest available on the system. Of course the kernel which apt listed was way older but at leased I thereby saw, that the kernel in the repos should come from the “security” list. All the systems had in common that I created them with debootstrap, most of them via a ansible role which explicitly adds universe as a component. I thought that was enough, but it leads to a sources.list with just the basic package list, leaving out the security and updates lists.

After I added them manually I got the ton of updates I was missing for something like a year.

deb http://de.archive.ubuntu.com/ubuntu focal main universe restricted
deb http://de.archive.ubuntu.com/ubuntu focal-security main universe restricted
deb http://de.archive.ubuntu.com/ubuntu focal-updates main universe restricted

Afterwards I read the manpage of debootstrap an figured out that it is not capable of bootstrapping from multiple sources. There is another tool called Multistrap for that, but it needs its information from a configfile and cannot be feeded via commandline parameters.

Update (20.09.2022): For the sake of completeness, the problem is the same for Debian systems and for systems with created with mkosi Here is the full file which contains all repositories:

deb http://deb.debian.org/debian bullseye main contrib non-free
deb http://deb.debian.org/debian-security/ bullseye-security main contrib non-free
deb http://deb.debian.org/debian bullseye-updates main contrib non-free
# optional
deb http://deb.debian.org/debian bullseye-backports main contrib non-free

Reinitialize pacmans package database

Wed, 08 Sep 2021 14:39:31 +0200

I had two Arch Linux machines in short time where pacmans database was deleted by accident (it resides under /var/lib/pacman/). Afterwards pacman doesn’t know about any installed packages, which I realized when no updates where available ;-) To verify this situation you can use pacman -Qe to list all installed packages. If you lost the database, of course your system continues working but it’s hard to recover the state of the database.

I had two Arch Linux machines in short time where pacmans database was deleted by accident (it resides under /var/lib/pacman/). Afterwards pacman doesn’t know about any installed packages, which I realized when no updates where available ;-) To verify this situation you can use pacman -Qe to list all installed packages.

If you lost the database, of course your system continues working but it’s hard to recover the state of the database. If you don’t have a recent backup you can check for installed and/or upgraded packages in /var/log/pacman.log

# search for ALPM messages containing installed or upgreaded and write each package name to probably_installed_pkgs.txt
grep ALPM /var/log/pacman.log | awk '/(installed|upgraded)/ {print $4}' | sort | uniq > probably_installed_pkgs.txt

# install each of these packages
for i in (cat probably_installed_pkgs.txt)
    echo "$i"
    sudo pacman --noconfirm -S --needed --overwrite "*" $i
    echo "----"
end

--noconfirm sets pacman to non-interactive
--needed prevents reinstalls, in case the pkg was already installed before as dependency
--overwrite tells pacman to ignore already existing files

This fixed most of misery. But I also had a bunch of packages installed which came from the AUR repos. To install them I wrote this second loop:

for i in (cat probably_installed_pkgs.txt)
  # check if pacman is able to find the package in the standard repos
  # otherwise install it with yay from the AUR
  pacman -Ss $i || yay --noconfirm --needed -S --overwrite '*' $i
end

Of course you can combine them into one.

In the end I save the result in a file.

# save all installed pkg names
yay -Qe > installed_pkgs.txt
# count the installed pkgs
yay -Qe | wc -l >> installed_pkgs.txt

This could be done via cronjob or a pacman post hook

Everything standardized, nothing works!

Tue, 31 Aug 2021 20:24:36 +0100

In some setups, mostly big storage arrays, SATA drives are prevented from spinning up automatically. There are many standards and lots of technologies to controll this behaviour. Unfortunately, I learned about all of them.

A few years ago I got a new home server which just did not work as I expected. This article will follow the different topics I learned about during this endeavour.
I try to keep all paragraphs separated from each other so that you can skip one if your not interested.

The basic problem

As described in this post I used an APU2c4 and a Intertech chassis as homeserver and added additional SATA controllers to be able to use 4 to 5 disks. After realising that controllers with an ASM1061 chip won’t work reliable with an APU Board, I bought one based on a Marvell 88SE9215. So I got the disks, the backplane with its SAS connector, a cable adapting to 4x SATA and the controller with 4x SATA. I put everything together and surprise: It didn’t work, the disks did not turn on!
I tested them with a USB adapter, they worked fine. Even with just the power cable connected they spinup immediately. So what’s going on here?!

The backplane

So I started digging around, what could it be. First thing that comes to mind is the backplane. Even if the power cable was connected to the backplane, the drives did not spin up.

The backplane has a 4 pin molex power connector supplying 12V and 5V. So I took my multimeter and checked for the SATA pinout. Both voltages are available on the disk, the 3.3V pins where at 0V but they aren’t used in disks anyway. I realised that only one 12V Pin is connected, but hey should be enough eh?

The backplane also has two LEDs per disk to show status and activity. If a disk is connected the blue status LED lights up, so the backplane registers a device presence. With some older disk additionally the green activity LED blinks once. This LED is wired together with the staggered spinup pin (PIN11 of the power connector) which is, to my research, normal.

So this pin is high on my backplane with no drive connected and somewhere around ~3.3V with a connected drive. Looks like a pull up resistor on the backplane tells the drives to enable staggered spin up. With staggered spinup enabled the wait for a specific SATA command before they start the motor. At least this explains why the disks aren’t spinning up when connected to the backplane.

In the end I could not determine any obvious fatal error or manufacturing defect on the backplane.

The cable

So I went on with examining the cable and oh-my-my this is a real abyss of standards and human incompetence!
But let’s walk through it step by step.

The connector on the backplane is a SFF-8087 connector (also called miniSAS, or internal SAS) which contains 4 Rx/Tx pairs with their related ground lines. Additionally there are pins for a 4 pin sideband connection which can be used by controllers to control the backplane (eg. for lighting up error or replacement LEDS). I also checked these pins on my backplane but could not measure anything on them.

So far so good, this connector can serve 4 distinct SAS or SATA lanes, but SAS and SATA don’t just differ in their commands and protocol but also in their pinout. SAS controller and devices have the same Rx/Tx pinout on their connectors and the swap is done in the cable (crossover). However SATA controllers and devices have switched Rx/Tx pins on their connectors and the cable is straight through.
And I now have a SATA disk, a backplane which supports both, an SATA to SFF-8087 SAS cable and a SATA controller, well fuck.

After spending oceans of time researching I could figure out that there are two different types of that adapter cables with a lot of different names:

the OCR cable which connects a backplane with SFF-8087 connector to a controller with four SATA ports, that’s what I need.
This is also called

Reverse Fanout
Reverse Breakout
Straight Through

the OCF cable which connects a SAS controller to a backplane or drive with SATA ports.
This is also called

Forward
Fanout
Breakout
Crossover

And yes, a third of all online shops got it just wrong or didn’t get it at all. And Yes, I bought 3 cables, two were wrong, one was the right one. Each around 16-20€.
I connected everything and guess what: The drive did not spinup!

Staggered Spinup (SSU)

Now that I’m absolutely sure that I got the right cable I have to search the error somewhere else. So maybe the disk is somehow told to wait with the spinup. As mentioned before there is a special pin for signaling the drives to do staggered spinup.

The 11th pin on the SATA power connector is not hardwired to its neighbors and is reserved for staggered spinup and activity signaling. So if pin 11 is connected to ground, which is done by a resistor on the disk, then the drive will start up immediately (normal mode). But if the pin is pulled to high, when the drive is connects, it won’t spin up until a SATA link is established and the controller sends an appropriate command.

I was frustrated enough to risk loosing a disk to get this thing working and so I just soldered the pins together and well what to say the drive spun up! “Holy cow I fixed it” I thought, went to my computer and realized that the drive was still not listed in Linux. Instead the kernel told me:

  [ 1126.507176] ata4: COMRESET failed (errno=-32)
  Dec 18 07:34:38 archiso kernel: ata4: COMRESET failed (errno=-32)
  Dec 18 07:34:38 archiso kernel: ata4: reset failed (errno=-32), retrying in 8 secs
  Dec 18 07:34:47 archiso kernel: ata4: SATA link down (SStatus 0 SControl 300)
  [ 1135.653828] ata3: COMRESET failed (errno=-32)
  Dec 18 07:34:47 archiso kernel: ata3: COMRESET failed (errno=-32)
  Dec 18 07:34:47 archiso kernel: ata3: reset failed (errno=-32), retrying in 8 secs
  Dec 18 07:34:56 archiso kernel: ata3: SATA link down (SStatus 0 SControl 300)

These messages indicates that even the COMRESET at the SATA link initialization failed. But at least something tries to connect here at all. There must be something else which keeps the drive from talking to the controller…

Power Up In Standby (PUIS)

Ahh the internet says there is another power down feature for SATA disks. It’s called Power Up in Standby and it means that the drive is set into a mode where it does not power up when it’s plugged in no matter what state the SSU pin is in. Alright so another thing to fix then we’re good to go. Some SATA drives have a jumper to set the PUIS mode, I actually found an old one in my stash and with the jumper set, the drive stays quiet when power is attached. Sadly the drives I want to use in the server don’t have jumper pins to set PUIS. But you can also read this feature via software (like hdparm).

hdparm -I /dev/sdX | grep -i -B 1 power-up
     *    SMART feature set
          Power-Up In Standby feature set

The missing * at the beginning of line indicates, that this feature is not set!!

With the following commands one can control the PUIS feature flag:

disable PUIS: sudo sg_sat_set_features -f 0x86 /dev/sdX --verbose
enable PUIS: sudo sg_sat_set_features -f 0x06 /dev/sdX --verbose

I tested it with the old drive and as expected it has the same effect as the jumper setting but the feature could neither be enabled nor disabled on the drives I want to use.

Power disable (PWDIS)

Well believe it or not but since SATA Rev 3.2+ there is another power feature called Power disable (PWDIS). Like staggered spinup it ist controlled via a pin on the SATA power connector in this case Pin 3 which was one of three 3.3V pin in earlier SATA revisions. See tom’s HARDWARE for further details.
I verified that this pin is not connected to the other two 3.3V pins (like on older disks) and then I started the whole endeavour again. I taped the pin, I soldered it to ground, I shorted it to ground after the disk was connected. Then I did all that in various combinations with both the SSU and the PWDIS pin.
The only measurable effect was some magic smoke leaving the backplane indicating that one of the activity LEDs has left our solar system. All four drives still not spinning up.

The cable, again

Still my mind told me that after all it is most likely that the cable must be the culprit. I searched for replacement in all offices and workshops I could think of and got a bunch of HP branded 2x SATA -> SAS adapter cables. With this the drives spun up and are recognized by the host system. Oh how crazy is that, was it the wrong cable after all? Well with the HP cable I could only use the first two drives but that’s at least something. I used the machine for more then a year with this setup till I ran out of storage. Anyway if two lanes work, there must be a solution for four SATA lanes. I wanted to understand what the difference between the working and the non-working cable is and so I ripped one of each type open.

These pictures show the PCB of a noname 4 lane cable (left) and a 2 lane HP cable (right). As you can see the layout is the same. In the center there are the sideband pins and on the each side there is a SATA dataline pair surrounded by GND. The HP cable had the sideband connected, I desoldered them, nothing changed. Also all wires had connection from one end to another.

As to be expected, no further noticeable difference between them.

The controller

The project was in such a horrible state that the occasionally heavy drinking started to become occasionally more often! At last on the list was the controller, although I seriously doubted that it could be the root cause. Nevertheless I checked my options.
As mentioned above I use a Marvell 88SE9215 which has no raid capabilities or something like that. The tooling provided by Marvell can only handle their raid controllers and did not even recognize mine. So I spent a night or two going through everything the AHCI driver exposes via sysfs about that controller. Then I read the source code of the AHCI driver and found a Marvell specific option Additionally there is another driver for these controllers which I could try, but honestly I didn’t even want to anymore.

The rapture

While waiting for the next best time to checkout some other combinations of parameters I talked to a lot of people about this mess.
Most of them avoided successfully to get dragged into this, some other lend me hardware for testing. My partner then told me that she likes the idea of fixing such a problem with money and what would be my options on that. So answered more ~~or less~~ sarcastically “Well, I could get a cable from a considerable brand which costs 4 times as much” - “Then you should try that!”

So I paid 45€ (!) for a SATA->SAS cable with all the schibberish I already told you about, plugged it in and it just worked. And it still does…

Booting Ubuntu 20.04 live via http

Wed, 28 Jul 2021 23:39:21 +0200

PXE boot concepts are often complicated chains of many stages. Init ramdisks contaning casper 1.445 can load a ISO over http as rootfs.

Most Linux distros offer an ISO file which could either be burned on CD or copied to an USB stick. Although [grub2] can handle such ISO files directly, SYSLINUX can’t and it’s part PXELINUX still used in a lot of PXE environments.

The idea behind PXE is that a small program is stored on the network card which obtains an IP address and then loads a system over the network (eg. FTP) As second stage PXELINUX is often used as bootloader which then loads a linux kernel and an initramfs. Once they are booted there is a whole lot of possible further bootpaths. Ubuntu uses the casper scripts to handle the actual boot sequence. Since casper version 1.445 which is packaged in Ubuntu 20.04 (Focal Fossa) it is capable to mount the rootfs not only via SMB/CIFS and NFS but it’s also possible to specify a http link to a ISO file which is then loaded and mounted as root filesystem.

I saw a lot of PXE setups which load the kernel and mount the root filesystem via NFS and for those which load the kernel via HTTP I wanted to eliminate NFS as a second necessary protocol. I had to experiment a bit to get this working, cause the manpage of casper suggested false default. But here is the PXELINUX configuration which worked eventually:

/srv/tftp/pxelinux.cfg/default

LABEL rescue-live-system
        MENU LABEL Rescue Live System
        KERNEL http://fileserver.local/rescuei.iso-mounted/casper/vmlinuz
        APPEND initrd=http://fileserver.local/rescue.iso-mounted/casper/initrd.lz boot=casper ip=dhcp netboot=url url=http://fileserver.local/rescue.iso console=tty0 console=ttyS1,115200

First lets go through the casper options:

boot=casper tells that casper should be used inside the initrd (under /scripts/casper)
ip=dhcp although the manpage of casper says DHCP is default, this option is mandatory
netboot=url instead of smb or nfs we want to netboot from a HTTP url
url=http://fileserver.local/rescue.iso well the path to the ISO which is booted.

If you want to server the files from a local fileserver you need to extract the kernel and the initrd file from the iso and serve them seperatly. To do that just mount the ISO and copy the files:

mount -o loop ubuntu-20.10-desktop-amd64.iso /mnt
cp -av /mnt/casper/{initrd,vmlinux} /var/www/path/to/fileserver/
umount -l /mnt

If you have internet access you can also boot an ubuntu image directly from ubuntu.com. But be careful since there is no HTTPS support!

Useful links

Howto from Ubuntu.com https://discourse.ubuntu.com/t/netbooting-the-live-server-installer/14510

Shiet Protection

Wed, 23 Jun 2021 15:40:21 +0200

Excel files can protect sheets/cells to prevent accidental editing. This protection can be “secured” by a password. This article describes how to remove that protection with basic Linux tools.

A friend of mine called me and ask if I know how to remove the write protection of a Excel file. No, I’ve never heard of such a thing so she explained that she was using a template form which she can fill out but can not edit the predefined cells or the sheets themselves.

Since the file was readable and not encrypted, I thought ‘how hard can it be?!', there must be a way to edit this thing without rebuilding it in a new Excel file.

After a short internet research I found multiple approaches:

Open it with LibreOffice and reexport it as a new xlsx file.
Open it with Google Sheets and reexport it as a new xlsx file.
Manually edit the files with a text editor
Lots of shady online tools
Brute forcing the password with Visual Basic Scripts

First I tried opening the file in LibreOffice but sadly I couldn’t remove the write protection. LibreOffice was also asking for a password. Then another friend found a LibreOffice version in a Fedora 32 which seems to ignore this password setting but after he exported the file as .xlsx again most references where broken and the manual fixing of those would probably take hours.
Then we tried the same thing with Google Sheets. The application also ignores the password setting and simple loads all sheets fully editable but sadly when reexporting the files, all references are broken.

At this time I felt rather triggered!

To describe that feature briefly: One can set a protection on sheets and cells which keeps users from editing them accidentally. That’s reasonable and even useful.
But then one could also set a password which is needed to unlock these cells and there is absolutely no technical mechanism ensuring that this password is checked. It completely depends on the implementation of the program which is used to work with that sheet. If it just doesn’t parse the password setting there is nothing preventing the editing. What kind of a functionality is that?!

Maybe it’s better to go one layer beneath. XLSX files are basically zipped directories of XML files. So first I extracted the document to be able to read its internals:

highway17 ~/W/excel> cp -av write_protected_excel_shiet.xlsx write_protected_excel_shiet.zip
'write_protected_excel_shiet.xlsx' -> 'write_protected_excel_shiet.zip'
highway17 ~/W/excel> unzip write_protected_excel_shiet.zip -d write_protected_excel_shiet.unzipped
Archive:  write_protected_excel_shiet.zip
  inflating: write_protected_excel_shiet.unzipped/[Content_Types].xml  
  inflating: write_protected_excel_shiet.unzipped/_rels/.rels  
  inflating: write_protected_excel_shiet.unzipped/xl/workbook.xml  
  inflating: write_protected_excel_shiet.unzipped/xl/_rels/workbook.xml.rels  
  inflating: write_protected_excel_shiet.unzipped/xl/styles.xml  
  inflating: write_protected_excel_shiet.unzipped/xl/worksheets/sheet1.xml  
  inflating: write_protected_excel_shiet.unzipped/xl/worksheets/sheet2.xml  
  ...
  inflating: write_protected_excel_shiet.unzipped/xl/worksheets/sheet12.xml  
 extracting: write_protected_excel_shiet.unzipped/xl/media/image1.png  
  inflating: write_protected_excel_shiet.unzipped/xl/worksheets/_rels/sheet11.xml.rels  
  ...

Some guides in the internet describe that the sheetProtection setting should be in the workbook.xml but I found two other Protection settings. In the xl/styles.xml file there are some xml tags which included applyProtection=. Some set to 1, some to 0. So as a first try I set them all to zero with

sed -i 's/applyProtection="1"/applyProtection="0"/g'

Afterwards LibreOffice rendered the file totally scrambled and Excel refused to open it. I was thinking about some checksum which verifies if the file might be corrupted but I couldn’t find anything about it.

So I search in the other files and found things like this in the xl/worksheets/sheetN.xml files:

highway17 ~/W/e/w/x/worksheets> grep -rio '<[^>]*Protection[^>]*' *
sheet1.xml:<sheetProtection selectLockedCells="1" selectUnlockedCells="1"/
sheet2.xml:<sheetProtection password="90AE" sheet="1" objects="1" scenarios="1" selectLockedCells="1"/
sheet3.xml:<sheetProtection password="90AE" sheet="1" objects="1" scenarios="1" selectLockedCells="1"/
sheet4.xml:<sheetProtection password="90AE" sheet="1" objects="1" scenarios="1" selectLockedCells="1"/
sheet5.xml:<sheetProtection password="90AE" sheet="1" objects="1" scenarios="1" selectLockedCells="1"/
sheet6.xml:<sheetProtection password="90AE" sheet="1" objects="1" scenarios="1" selectLockedCells="1"/
sheet7.xml:<sheetProtection password="90AE" sheet="1" objects="1" scenarios="1" selectLockedCells="1"/
sheet8.xml:<sheetProtection password="90AE" sheet="1" objects="1" scenarios="1" selectLockedCells="1"/
sheet9.xml:<sheetProtection password="90AE" sheet="1" objects="1" scenarios="1" selectLockedCells="1"/
sheet10.xml:<sheetProtection sheet="1" objects="1" scenarios="1" selectLockedCells="1" autoFilter="0" selectUnlockedCells="1"/
sheet11.xml:<sheetProtection sheet="1" objects="1" scenarios="1" selectLockedCells="1" selectUnlockedCells="1"/

The password isn’t there in plaintext of course. Some hashing must be in use but I didn’t investigate much further since I want to fully remove the password instead of recovering it. However I just removed all these “sheetProtection” tags and zipped the whole thing together again.

egrep -lRZ '<sheetProtect[^>]*' . | xargs -0 -l sed -i 's/<sheetProtection[^>]*//g'
cd ../../..
zip -r unlocked.xlsx ./*

That worked! I could open the file with LibreOffice, the forms were in order, everything was unprotected and editable and last but not least Excel was able to open the files with working references.

Search and destroy corrupt pages in Postgres databases

Tue, 22 Jun 2021 13:04:41 +0200

Postgres can get in a sitation where there are corrupt datasets in the DB which are neither readable nor deleteable. This article describes how to identify such blocks and zero them with basic linux utils

At work I found a very slow and rather huge Postgres database. Over 750GB of Bareos backup information with such incredible slow I/O that the jobs sometime timeout. So I started to clean the thing up and realized that the problem wasn’t just the amount of data but obviously the DB was corrupt.

The corrupt database

During my attempts to clean the database I tried some Bareos cleanup job which all failed.

root@bareos-director-live:~# systemd-run -t bareos-dbcheck -b -f
Query failed: SELECT FileId,Name from File WHERE Name LIKE '%/': ERR=PANIC:  corrupted item lengths: total 22664, available space 7912
SSL SYSCALL error: EOF detected

Then I logged into the Postgres DB directly and manually started a VACUUM task to see what exactly fails. These should normally run automatically as AUTOVACUUM but I the only error report from them which I could find was `autovacuum: found orphan temp table “pg_temp_20”.
Attempting an manual VACUUM FULL which rewrites all data and discards obsolete entries was unsuccessful:

bareos_catalog=# VACUUM FULL;
WARNING:  concurrent delete in progress within table "file"
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

Okay the working threat just crashes when doing a VACUUM, that explains why the Bareos cleanup jobs cannot finish and probably also explains why the database is so big. Space with old data just cannot be claimed again and is therefore never overwritten.

So I tried to figure out where exactly the corrupt data lies. Therefore I ran the VACCUUM on every single table in the database and voilà: the ‘file’ table seems to be the culprit.

bareos_catalog=# VACUUM VERBOSE file;
INFO:  vacuuming "public.file"
INFO:  scanned index "file_pkey" to remove 89478200 row versions
DETAIL:  CPU 33.06s/183.07u sec elapsed 298.52 sec
...
INFO:  scanned index "file_pjidpart_idx" to remove 89478227 row versions
DETAIL:  CPU 1.54s/2.08u sec elapsed 13.91 sec
INFO:  "file": removed 89478227 row versions in 2349497 pages
DETAIL:  CPU 32.93s/17.48u sec elapsed 873.81 sec
PANIC:  corrupted item lengths: total 22664, available space 7912
server closed the connection unexpectedly

Attempting to fix it with Postgres tools

Somewhere in this table there seems to be an entry which says it is longer than it can be.

I tried to dump the whole database and hoped that I would be able to reimport it, but no:

root@bareos-dbm-live01:~# sudo -u postgres pg_dump --user bareos_catalog bareos_catalog > /var/lib/postgresql/dump.sql
Password: *************************

pg_dump: error: Dumping the contents of table "file" failed: PQgetResult() failed.
pg_dump: error: Error message from server: PANIC:  corrupted item lengths: total 22664, available space 7912
pg_dump: error: The command was: COPY public.file (fileid, fileindex, jobid, pathid, deltaseq, markid, fhinfo, fhnode, lstat, md5, name) TO stdout;

I search the web for an solution and found out that one approach is to set the option zero_damaged_pages = true and REINDEX the table afterwards.
Dataloss was okay for me at this time, as long as just one entry from the file tables would be lost. The Bareos backups which rely on this database would be still restorable.

But of course it ain’t that simple:

bareos_catalog=# REINDEX table file;
WARNING:  concurrent delete in progress within table "file"
ERROR:  could not access status of transaction 125829127
DETAIL:  Could not open file "pg_subtrans/0780": No such file or directory.
CONTEXT:  while checking uniqueness of tuple (23963062,6) in relation "file"

Now this looks like Postgres searches for data in a non existing file, ugh. Tried to touch that file, made it even worse. Well but at least we’ve got some information from this. The CONTEXT line explains that the error occures when working on a specific tuple (a dataset in the database).

Investigating the bad datasets

So maybe we can work on that thing directly, read it or delete it.

bareos_catalog=# select * from file where ctid='(23963062,6)';
ERROR:  invalid memory alloc request size 18446744073709551613

Nope! Don’t have that much memory installed right now and I don’t think the tuple should be that big. Although it is possible to read the first few columns of the tuple like this:

bareos_catalog=# SELECT ctid,fileid,fileindex,jobid,pathid,deltaseq,markid FROM file WHERE ctid='(23963062,6)';
     ctid     |       fileid        | fileindex  |   jobid    |   pathid   | deltaseq |   markid
--------------+---------------------+------------+------------+------------+----------+------------
 (23963062,6) | 7885913182697562177 | 1109424715 | 1664625506 | 1682055211 |    25968 | 1092632864
(1 row)

But as soon as I tried to read the next column fhinfo I get an abnormal long response and all following columns are unreadable. Also the pointer to the tuple which resides before the corrupt one (23963062,5) works fine.

Then I tried to read all the tuples in the page. The page no. is the 23963062 and the digit after the comma is the tuple. I found out that there are dozens of tuples with the same type of curruption and also one which complains about ERROR: duplicate key value violates unique constraint "file_pkey". Reading data from other pages was possible. Maybe we have to accept that this whole page is lost…

Tracking down the bad pages on disk

Now I really was on new territory. I found two guides which describe more or less the same way to zapp out an entire page from a corrupt Postgres DB.
https://www.endpoint.com/blog/2010/06/01/tracking-down-database-corruption-with
https://www.postgresql.org/message-id/1184245756.24101.178.camel@coppola.muc.ecircle.de
Respect for these howtos, and thaks a lot for sharing!

First it must be clarified in which file the corrupt page resides. Therefore we need to get the OID of the database, the RELFILENODE of the table and the blocksize to calculate the offset:

bareos_catalog=# select oid from pg_database where datname = 'bareos_catalog';
  oid
-------
 16387
(1 row)
bareos_catalog=# select relfilenode from pg_class where relname = 'file';
 relfilenode
-------------
       16402
(1 row)
postgres=# SELECT current_setting('block_size');
 current_setting
-----------------
 8192
(1 row)

With this information the file can be located under /var/lib/postgresql/9.6/main/base/16387/16402 but since the table is several hundred GB in size, the table is splitted into multiple files.

root@syseleven.managementbki1.backupng.live.dbm:/var/lib/postgresql/9.6/main# ls -1 base/16387/16402\* | wc -l
359

So we still need to figure out in which sub file the page resides. Let’s start calculating

Byte offset of the page is the page no. multiplied with the block size: 23963062 * 8192 = 196305403904
Now each split file of the table has the size of 1 GB (I think I read that somewhere). So we’re search for the no. of the file in which out page offset lies.
That is the page no. multiplied with the block size devided by 1GB: 23963062 * 8192 / 1024 / 1024 / 1024 = 182
The page should be located in the 182nd file; now we’re searching for the byte offset inside that file.
To get that I just substracted 182GB from the overall offset: 23963062 * 8192 - ( 182 * 1024 * 1024 * 1024 ) = 884391936
And last but not least we calculate the offset inside that file in page no. instead of bytes by deviding the offset by the block size: 884391936 / 8192 = 107958

Now before you do anything of this, make sure your database is stopped an you have a backup. Things can still get much worse.

We can build a dd command with that no. to read out the corrupt page. If you know what data is to be expected here, you can verify that you’re on the correct page. I didn’t.

dd if=base/16387/16402.182 bs=8192 count=1 skip=107958 of=test.dd

xxd test.dd | head
000001d0: 0000 0000 0700 8007 0080 6941 2041 2049  ..........iA A I
000001e0: 4567 2041 2041 2041 2041 202d 4220 4241  Eg A A A A -B BA
000001f0: 4120 4220 4264 706d 4b7a 2042 6347 6765  A B BdpmKz BcGge
00000200: 6320 4264 7065 7868 2041 2041 2043 5954  c Bdpexh A A CYT
00000210: 6448 776a 5737 6943 7a4a 5870 3941 4f48  dHwjW7iCzJXp9AOH
00000220: 4448 4b4b 3668 3231 6a70 3558 7174 4537  DHKK6h21jp5XqtE7
00000230: 5a48 7255 5059 416e 6738 1363 6f6e 7465  ZHrUPYAng8.conte
00000240: 6e74 7302 a455 ee02 520a c003 0000 0000  nts..U..R.......
00000250: 0000 0000 6d01 b5a5 3b00 0b00 0209 1800  ....m...;.......
00000260: d8fd 5fcd 0100 0000 ef52 0000 d6e9 0000  .._......R......
...

I just overwrote the whole page with zeros.

Deleting the corrupt pages

Yes that means dataloss but there’s no way to repair the corrupt tuples and most probaly the data in the page is already obsolete but could not be reclaimed.

As mentioned in the other guides: notice the conv=notrunc to prevent dd from truncating the rest of the file!

dd if=/dev/zero conv=notrunc of=base/16387/16402.182 bs=8192 count=1 seek=107958

I read the page again and saw 8K of zeros in there. Looks good to me. Time to look confident and starting Postgres again.

bareos_catalog=# select * from file WHERE ctid='(23963062,6)';
 fileid | fileindex | jobid | pathid | deltaseq | markid | fhinfo | fhnode | lstat | md5 | name
--------+-----------+-------+--------+----------+--------+--------+--------+-------+-----+------
(0 rows)
bareos_catalog=# VACUUM VERBOSE file;
INFO:  vacuuming "public.file"
WARNING:  relation "file" page 23963062 is uninitialized --- fixing
INFO:  scanned index "file_pkey" to remove 178956713 row versions
DETAIL:  CPU 36.52s/189.18u sec elapsed 324.00 sec
....
INFO:  "file": found 863859297 removable, 21859469 nonremovable row versions in 25139106 out of 46748205 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 287140903 unused item pointers.
Skipped 0 pages due to buffer pins.
1 page is entirely empty.
...

Success! No error when reading the page. And the VACUUM recognized it as uninitialized. It was possible to remove a few hundred GBs of data from that database afterwards which was sleeping there due to the fact that VACUUM wasn’t working correctly.

Building Arch Linux packages under Ubuntu

Mon, 28 Dec 2020 17:22:02 +0100

A few times I had the need to build Arch Linux AUR packages which took a long time on my notebook. I had access to some faster hosts but they were all running Ubuntu (16.04 and later). To keep this in mind I write down how to create a Arch Linux build chroot inside an Ubuntu host. Bootstrapping Arch Linux First we need to get the Arch Linux base filesystem and extract it to a directory.

Bootstrapping Arch Linux

First we need to get the Arch Linux base filesystem and extract it to a directory.

cd /var/lib/machines/
wget http://mirrors.dotsrc.org/archlinux/iso/2020.12.01/archlinux-bootstrap-2020.12.01-x86_64.tar.gz
btrfs subvolume create archlinux-builder
tar -zxvf archlinux-bootstrap-2020.12.01-x86_64.tar.gz 
mv root.x86_64/* archlinux-builder
rmdir root.x86_64

Afterwards we can configure the DNS settings and chroot to it

cd archlinux-builder
cp /etc/resolv.conf etc
bin/arch-chroot .

Finally we can configure pacman inside the chroot and install/build software under a separate user

# select a mirror
vim /etc/pacman.d/mirrorlist
vim /etc/pacman.conf
pacman-key --init
pacman-key --refresh-keys
pacman -Syu
pacman -S --needed base-devel git
pacman -S --needed git base-devel
git clone https://aur.archlinux.org/yay.git
useradd builder
echo "builder  ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/builder
chown builder: -R yay
cd yay
su builder
makepkg -si

Now we can build Arch Linux AUR packages inside that chroot with makepkg or yay.
I usually save the freshly configured state of such a chroot with

# either with BTRFS
btrfs subvolume snapshot archlinux-builder archlinux-builder-fresh
# or with systemds machinectl
machinectl export-tar archlinux-builder archlinux-builder-fresh.tar

Links

Install Arch Linux from existing Linux (Arch Linux wiki)

Lichtverschmutzer

Wed, 02 Dec 2020 11:30:08 +0100

We disassembled a large equipment cabinet in our storehouse to use its space and I got a lot of pneumatic gear out of it. Since I don’t have any kind of compressed air systems at home I did what I do most: I used it to build a desk lamp. I removed all the filter stuff from the chassis to conduct a USB cable through and screwed a IKEA LED light arm on top.

Lichtverschmutzer Enterprise Edition

Sun, 29 Nov 2020 22:30:08 +0100

I had another air filter with a barometer left over. Again I took all the filter things out of it to make room for a cable coming in on one of the intakes and I added the arm of a floor lamp on the top of the filter. For this one I ordered a 12V warm white 1W LED and put it on a heatsink with thermal glue. The heatsink just fit into the round chassis on the end of the arm and is hold in place by a screw.

For this one I ordered a 12V warm white 1W LED and put it on a heatsink with thermal glue. The heatsink just fit into the round chassis on the end of the arm and is hold in place by a screw. This one has a external power supply.

DNS Resolution on Laptops

Thu, 26 Nov 2020 19:49:40 +0100

The story of struggeling with the always wrong DNS server on my laptop and the behaviour of systemd-resolved with static and dhcp announced DNS servers.

This topic bugged me for a very long time! In Germany there a quite a lot ISPs which censor specific content either due law enforcements or to assert their own interests. Since years I wanted to use DNS Servers which I chose myself so I nailed down my laptop to a fixed DNS server.
However every time when I wanted to use a network which either use DNS spoofing for a captive portal or which has some local domains I had to manually change the DNS settings on my laptop. That can get quite annoying also because afterwards I always have to revert the change.
So for quite a time researching about local DNS resolution settings was my hobby during longer train rides and I figured out the following requirements for my setup:

use a static DNS server (always)
fallback to the DNS server announced via DHCP
optional: fallback to other DNS servers
possibility to add local DNS entries

After playing around with dnsmasq and having a rather bad time with resolveconf I ended up using systemd-resolved in combination with systemd-networkd which I already used for my Simple wireless setup

Basic configuration

So I just stepped through the requirements listed above and configured them into /etc/systemd/resolved.conf

[Resolve]
DNS=46.246.46.246
FallbackDNS=1.1.1.1 8.8.4.4 8.8.8.8 2001:4860:4860::8888 2001:4860:4860::8844
Cache=yes

The server set as DNS= is my static DNS server which I trust in this example. The FallbackDNS= servers are other public DNS servers which probably have better availability. systemd-resolved automatically gets the DNS servers announced via DHCP from systemd-networkd and uses them as normal DNS servers.
Pinging google.se will result in the following DNS requests:

01:48:38.670451 IP 10.123.9.194.43043 > 46.246.46.246.53: 25460+% [1au] A? google.se. (61)
01:48:38.670742 IP 10.123.9.194.50049 > 46.246.46.246.53: 25901+% [1au] AAAA? google.se. (61)
01:48:38.670904 IP 10.123.9.194.59267 > 10.123.9.1.53: 35751+% [1au] A? google.se. (38)
01:48:38.671034 IP 10.123.9.194.49471 > 10.123.9.1.53: 57442+% [1au] AAAA? google.se. (38)
01:48:38.708778 IP 10.123.9.1.53 > 10.123.9.194.59267: 35751 1/0/1 A 216.58.207.67 (54)
01:48:38.708779 IP 10.123.9.1.53 > 10.123.9.194.49471: 57442 1/0/1 AAAA 2a00:1450:4001:81e::2003 (66)
01:48:38.762701 IP 46.246.46.246.53 > 10.123.9.194.43043: 25460 1/0/1 A 172.217.20.35 (54)
01:48:38.762702 IP 46.246.46.246.53 > 10.123.9.194.50049: 25901 1/0/1 AAAA 2a00:1450:400f:80a::2003 (66)

This leads to the result that both servers are asked at the same time and therefore the local DNS server is faster in nearly every case and effectively disables the preferred non local DNS server.

On the other hand, if there would be a fixed order in which the DNS servers would be queried then it can happened that the external is queried first about a local domain, like a captive portal. It will then respond with NXDOMAIN and no other DNS servers will be queried.
See https://github.com/systemd/systemd/issues/5755#issuecomment-297005909 for details.

Maybe in the end it’s the best way to disable the DHCP DNS servers in the systemd-network config file for interfaces with changing connections and then reenable them for specific domains. Of course that would mean to maintain lists of domains which should be resolved by DHCP announced DNS servers, which I do not want to do. But it is probably the best way if you don’t trust them!

In the end the setup still misses the requirements from the beginning but it’s the best I’ve got so far and it’s still quite simple.

Links:
https://www.freedesktop.org/software/systemd/man/resolved.conf.html#Options
https://www.freedesktop.org/software/systemd/man/systemd.network.html#%5BNetwork%5D%20Section%20Options
https://www.freedesktop.org/software/systemd/man/systemd.network.html#Domains=

Lichtmaschine

Tue, 03 Nov 2020 20:12:56 +0100

A desk lamp made from an old alternator and an industrial shade

After my wisdom teeth have been drawed I had two times two weeks of being at home, half the day on painkillers *_*

First I wanted to build a light for bringing my balcony plants through the winter so I decided to go through my metal and electronic trash in search for something I could use to make a full lamp out of an old industrial lampshade from the DDR (East Germany)

I found some wood for a base and an old alternator from my car which seemed suitable for being a weight and revolving stand at the same time. As arms I used solid copper cables with ~1cm diameter.

I fixed the alternator into the wooden base with epoxy and screwed the cables on top of it. The strands were scrambled bent into the alternator and a 12V DC power supply attached to two of them so that they can power the LED stripes inside the shade.

I called this one “Lichtmaschine” as this is the German word for the alternator in a car but literally it means ‘light machine’. It ended up being my desk lamp.

Host and container name resolution with LLMNR

Thu, 02 Jul 2020 18:53:02 +0200

LLMNR is a rather easy yet unknown protocol to resolve hostnames on a local link. On most modern Linux systems it can be used with systemd-resolved. This article describes the basic components and configs

When I use containers, I don’t want to fumble around with IPs. Especially not when they change each time a container is created from scratch. Of course Docker has its way to address this problem but since I use systemd-nspawn most of the time I wanted to figure out how to do that on my own.

Name resolution on linux

On a Linux system there are multiple ways how a program can resolve a name to an address. The most common way is to ask the glibc system library via a call of getaddrinfo() for example. Another one would be to ask another program via D-Bus interface.
But that library can do way more than just do a usual DNS request. It is backed by a system called Name Service Switch (NSS) which performs lookups for many function calls.
NSS has a lot of modular libraries which can be enabled and ordered in /etc/nswitch.conf. For example libnss-mdns would allow doing DNS lookups via a multicastDNS resolver like avahi, libnss-resolve allows the usage of systemd-resolved and finally libnss-mymachines can lookup local container names.
So with that we could easily reach our containers by name. But we also want the container to be able to lookup its host.

While researching I stumbled upon a technology called Link-Local Multicast Name Resolution (LLMNR). It allows the Multicast Name Resolution bound to a link and is therefore ideal for this usecase. Its available in systemd-resolved by default and ca be enabled via

#/etc/systemd/resolved.conf
[Resolve]
LLMNR=yes

If you cannot or want not use systemd-resolved have a look at llmnrd.

Configuration

Setting up the host

On the host system we want to directly lookup local containers. Therefore we install the libnss-mymachines packages and enabled it in the configuration:

#/etc/nsswitch.conf
hosts: files mymachines dns

It does not matter what is in your /etc/resolv.conf because NSS will first try to lookup a name in files like /etc/hosts then asks the mymachines library.
Of course you can reorder the libraries but be aware that each library delays all DNS request which it cannot resolve.

Setting up the container

Inside the container we want to configure LLMNR instead but it’s the same principle: First install the libnss-resolve package and then activate it:

#/etc/nsswitch.conf
hosts: files resolve dns

Ensure that systemd-resolved is configured correctly and is up and runnning:

#/etc/systemd/resolved.conf
[Resolve]
LLMNR=yes
Cache=no
MulticastDNS=no

sudo systemctl enabe --now systemd-resolved

You have to enable LLMNR on the host and inside the container so get them both speak to each other. I also disabled multicast DNS and caching inside the container so that they cannot interfere with LLMNR which now discovers automatically any configured addresses of your container, both IPv4 and IPv6 including link-local addressing:

root@host ~# machinectl
MACHINE           CLASS     SERVICE        OS     VERSION ADDRESSES       
test-container    container systemd-nspawn ubuntu 18.04   192.168.240.200…
1 machines listed.

root@host ~# ping -c 1 test-container
PING test-container(test-container (fe80::2473:aeff:fe41:3bcb%ve-test-container)) 56 data bytes
64 bytes from test-container (fe80::2473:aeff:fe41:3bcb%ve-test-container): icmp_seq=1 ttl=64 time=0.054 ms
--- test-container ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.054/0.054/0.054/0.000 ms

root@host ~# ping -4 -c 1 test-container
PING test-container (169.254.179.218) 56(84) bytes of data.
64 bytes from test-container (169.254.179.218): icmp_seq=1 ttl=64 time=0.085 ms
--- test-container ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.085/0.085/0.085/0.000 ms

Even if the LLMNR configuration is broken and not working, the host can still resolve the containers names via mymachines lib:

root@host ~# systemctl stop systemd-resolved.service 
root@host ~# ping -c 1 test-container
PING test-container(fe80::2473:aeff:fe41:3bcb%ve-test-container) 56 data bytes
64 bytes from fe80::2473:aeff:fe41:3bcb%ve-test-container: icmp_seq=1 ttl=64 time=0.058 ms
--- test-container ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.058/0.058/0.058/0.000 ms

Alternative setup

Instead of using libnss-mymachines and libnss-resolve you can also set up systemd-resolved to use LLMNR and the point to it in you /etc/resolv.conf:

#/etc/resovl.conf
nameserver 127.0.0.53

The systemd-resolved is listening on that address by default and can resolve your requests for you via different protocols including LLMNR if its enabled in the config:

root@host ~# systemctl start systemd-resolved.service
root@host ~# echo "nameserver 127.0.0.53" > /etc/resolv.conf
root@host ~# ping -c 1 test-container
PING test-container (169.254.179.218) 56(84) bytes of data.
64 bytes from test-container (169.254.179.218): icmp_seq=1 ttl=64 time=0.068 ms
--- test-container ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.068/0.068/0.068/0.000 ms

FOSDEM 2020

Sun, 09 Feb 2020 23:57:18 +0100

This year I visited Brussels with some colleagues to attend the FOSDEM Conference for the first time https://fosdem.org Conference and Organisation The conference itself takes place in the Université Liberal Brussels (ULB) which is a campus of more or less rotten buildings and can be visited without a ticket or fee. There are many many talks over two days which average on about 30 minutes and are held in 29 different rooms parallel.

This year I visited Brussels with some colleagues to attend the FOSDEM Conference for the first time https://fosdem.org

Conference and Organisation

The conference itself takes place in the Université Liberal Brussels (ULB) which is a campus of more or less rotten buildings and can be visited without a ticket or fee. There are many many talks over two days which average on about 30 minutes and are held in 29 different rooms parallel. The rooms are rather small and the conference is very frequented, so sometimes its not possible to get in the room in time, but there is a video stream+recording of every talk and the wifi can handle them pretty well. Also there are a lot of friendly helping organizers and volunteers so when one is awake enough to handle the crowds the conference becomes quite easy. Hint: Even if the Belgium beer is very good that does not mean you can drink more of it without consequences.

Linux and Memory Management

The Linux memory management is a topic which comes up very often, especially when dividing physical memory between different virtual machines. Also there is a lot of superficial knowledge about it and people often speak of memory as if it would be a few bytes which are either full or empty and that’s it. Well the reality is way more complex and I am struggling for a few years now to get a better understanding of how one can measure, analyze and configure memory usage in Linux. So I was quite happy to see the talk of Chris Down who told us about the tools and techniques of Facebook to thrash memory. https://fosdem.org/2020/schedule/event/containers_memory_management/

Strongly related is the question “To swap or not swap” which I have discussed multiple times in the last few years, always with different outcomes. Chris wrote a long article about SWAP and why it is still needed in modern systems. https://chrisdown.name/2018/01/02/in-defence-of-swap.html / bit.ly/whyswap

So I have to say that I still have to learn a lot about memory management and I am absolutely sure that this knowledge can become crucial in analysing and preventing incidents. Every time I dive into this topic I realise that memory is not a barrel which just fills up but a managed system with complex and interdependent rules. This system must be managed as such and therefore understood to keep it working correctly. When OOM killer runs and destroys a process, the cause is already lost and the system was mismanaged.

Virtualization and Systemd

So we come to the question how we can ensure that a computer system managed correctly and the resources are shared wisely. I think we can all agree that the time when a Linux system was a kernel which then uses a lot of shell scripts to startup manually configured applications are over. A modern system consists of a lot of management and surveillance software and of course it is virtually divided into pieces which do different kinds of work. In this context I want to highlight systemd one more time because there is still a lot to learn about it for me. Systemd is capable of doing a lot of stuff way more than just replacing the init V system. Nowadays with systemd a linux system is aware that it is not just an OS running on a computer but it has a hierarchy of subsystems (Devices, Processes, Services, Namespaces, Containers). In contrast to docker and vagrant, systemd comes with many modern linux distributions and is highly integrated into the OS and an operator can use it as a modular system to jail applications, let them depend on hardware changes or manage their resources (see also Linux and Memory Management) I attended a talk about systemds security features and its ability to abstract the kernel and limit resources and syscalls:

https://fosdem.org/2020/schedule/event/ussftbasd/
https://github.com/keszybz/systemd-security-talk/blob/master/jesie%C5%84-systemd-security.pdf They also recommends this documentation
https://systemd.io/

I think of systemd as a well designed and well documented structure of a modern Linux which finally enables us to configure our system from the kernel up to the application with modules which know about each other, all in the same style. And eventually we can get rid of wiggly, half-baked solutions like VirtuozzoLinux or proprietary appliance systems.

Micro- and Unikernels

On Sunday at FOSDEM I wanted to look out to some future stuff (beyond Kubernetes). In the autumn of 2018 I attended the MirageOS hack retreat in Marrakesh, Marocco (mirageos.io). Because I already thought of Kubernetes and docker as over complex bloated feature creep software I really enjoined the radical approach of compiling just the code which is needed into a unikernel for smaller footprint, speed and better security. Think of it as a statically linked application which comes as a bootable image. I was happy to see a half a day Micro-/Unikernel track at FOSDEM and realized that there are a lot of projects in this area. I especially want to emphasize the two following talks

https://fosdem.org/2020/schedule/event/uk_hipperos/ - a realtime, multicore, hierarchical kernel for embedded systems
https://fosdem.org/2020/schedule/event/uk_unicraft/ - a toolchain to build various unikernels

Unicraft is something which definitely looks like a good starting point to get in touch with Unikernels because it supports a lot of common tools and languages (mirageOS was kinda hard for me because it is written in OCaml). Anyhow, I believe that Unikernels will be the next big thing after Kubernetes and other cloud approaches become too complicated to be maintainable anymore and because of the million lines of code they waste with abstraction too slow and too inflexible.

So in the end this FOSDEM visited encouraged me in my strategy to

learn more about kernel code to understand why a computer behaves the way it does
using Linux systems as a cluster of software which is aware of and integrable into each other
building large scale applications with the smallest (carbon) footprint possible

Clouds are made out of tiny water drops not fuel tanks.

Installing Ubuntu 18.04 from Archlinux on APU Board

Thu, 26 Dec 2019 15:04:42 +0100

I decided to use my second APU2c4 board as my new Nextcloud/Mailserver. Since this thing should be reliable, especially reboot safe because I have to travel a few hundred kilometers to get hardware access in the datacenter, I decided to use Ubuntu as OS. Sadly the Ubuntu standard installer ISO is not configured to be used only via serial console. Although there are some hints in the web how to change this I never got and Ubuntu USB stick working for installation so I decided to install Ubuntu without its installer (don’t like it anyway) from another livelinux.

Boot Archlinux

To get started a Linux USB stick which can start a live system. In this example Arch Linux is used, Debian should work as well.
If the stick is plugged into the APU Board and the serial cable is connected one can use minicom to start the OS

# connect via serial to the APU
sudo minicom -D /dev/ttyUSB0

Then select a baudrate of 38400 to use the Archlinux bootloader. You can do that in a running minicom with Ctrl+a p d

Afterward select “Boot Archlinux” and press Tab to edit the bootloader line and append console=ttyS0,115200. Sure you can use every baudrate you want, I use 115200 whenever possible because in the end you use that serial line more than expected and you’re happy if things don’t take that much time.

Partitioning

Once the livesystem is started we can layout our desired partitions. I use parted for this:

# example
root@archiso:~# parted
GNU Parted 3.2
Using /dev/sdX
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) select /dev/sdX
Using /dev/sdX
(parted)
# create a partion table
(parted) mktable msdos
# create a partion
(parted) mkpart primary btrfs 0% 100%
(parted) print
Model: ATA SATA SSD (scsi)
Disk /dev/sda: 16.0GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags: 

Number  Start   End     Size    Type     File system  Flags
 1      1049kB  16.0GB  16.0GB  primary  btrfs

I tried another layout before with GPT partition table and a extra partition for /boot and another one for swap but I always ended up with a corrupt /boot filesystem after I installed the OS so I decided to use just on Btrfs and do everything else in there.

Afterwards we need to create a filesystem, parted doesn’t do that for us even it might be implied. Just use mkfs.btrfs /dev/sdX1 and mount it with mount /dev/sda1 /mnt

Install Ubuntu

To install Ubuntu we need debootstrap, a tool to generate chroots and basic system files. So ensure you got a working internet connection and install it

# install debootstrap
pacman -Sy debootstrap
# use it to generate a ubuntu bionic filesystem in /mnt
debootstrap --components=main,universe --include ssh,vim,systemd,linux-generic,grub2-common,grub-pc,btrfs-progs --arch=amd64 bionic /mnt http://de.archive.ubuntu.com/ubuntu
# chroot to that directory
arch-chroot /mnt

You are now inside your future ubuntu installation! You can start to configure it as you wish. There is to say that I had to add some PATHs:

root@archiso:~# ls
-bash: ls: command not found
root@archiso:~# PATH=$PATH:/sbin:/bin
root@archiso:~# ls /
bin   etc         initrd.img.old  media  proc  sbin      sys  var
boot  home        lib             mnt    root  srv       tmp  vmlinuz
dev   initrd.img  lib64           opt    run   swapfile  usr  vmlinuz.old

This only occurred when I chrooted from Arch to Ubuntu and wasn’t a problem in the final installation of Ubuntu.

Configure Ubuntu

I decided to only do some very basic config in this session because the host will be configured once the Ubuntu can boot by itself.

# configure timezone
dpkg-reconfigure tzdata
# configure locales, sometimes apt need them
localedef -i en_US -c -f UTF-8 en_US.UTF-8
# set a root password
passwd
# (or) add a username
useradd foobar
passwd foobar
# set a hostname, otherwise your ubuntu will get used to the name archiso :-)
hostname
vim /etc/hosts
apt install btrfs-progs
btrfs filesystem label / system

You need to generate a fstab file manually. I recommend using the UUIDs instead of /dev/sdX which can be obtained with lsblk --output name,uuid,type,mountpoint,label or a btrfs label if used. In the end save a file like this to /etc/fstab:

# device-spec               mount-point     fs-type      options                       dump pass
/dev/disk/by-label/system   /               btrfs        rw,relatime,nofail,subvol=/,  0    0

Configure GRUB

Even a basic ubuntu system now exists on the APU, there is still no boot process defined. We install the GRUB Bootloader on our disk by running

grub-install /dev/sdX

and ensure that /boot/grub/i386-pc/btrfs.mod exists. Also have a lookout for vmlinuz and initrd.img in /boot. If somethings missing ensure that linux-image-generic and grub-pc-bin are installed correctly

If your /boot folder looks good edit /etc/default/grub so that your system also uses the serial console

# disable the quiet boot
GRUB_CMDLINE_LINUX_DEFAULT=""
# tell the linux kernel to use the serial for console output
GRUB_CMDLINE_LINUX="console=ttyS0,115200 acpi=off"
# tell grub to use the serial
GRUB_TERMINAL=serial
GRUB_SERIAL_COMMAND="serial --unit=0 --speed=115200"

and run update-grub to update GRUBs config files.

Now your are ready to reboot into your new installed Ubuntu system.

Simple Mailserver Setup

Thu, 24 Oct 2019 17:02:35 +0200

Preamble This is a classic one. I run my own mailserver since I’m legally allowed to rent rootservers and twice a decade this stuff had to be rebuilt somehow. I ran a setup with postfix + dbmail + spamassasin for a long time but since dbmail is completly dead and it’s a hassle to get the share libraries straight on anything newer that Ubuntu 14.04 I must switch to something more convenient and better supported.

Preamble

This is a classic one. I run my own mailserver since I’m legally allowed to rent rootservers and twice a decade this stuff had to be rebuilt somehow. I ran a setup with postfix + dbmail + spamassasin for a long time but since dbmail is completly dead and it’s a hassle to get the share libraries straight on anything newer that Ubuntu 14.04 I must switch to something more convenient and better supported.

For the new mailserver I wanted to change a few things in the software stack and setup organisation. First I decided to build it following these principles:

keep it simple Mailservers are complex enough
config management For the whole setup. no more ‘oh shit where is this special config tweak’
reduce overhead No special additions like admin webinterface/containerization etc. which have to be maintained

Then I thought about the technical specs. The mailserver must support the following features:

smtp+imap+sieve Basic feature to use email.
webinterface For all situation on the run
password change Users must be able to change their password without an admin

Eventually I came up with the following decisions

postfix+dovecot+btrfs Stable software, no database just a filesystem
no spamfilter Since I receive less than 2 spam mails per month with a decent configured mailserver
Ansible+git Use Ansible for config management and setup and git to store the Ansible stuff
nextcloud+rainloop A already running nextcloud will be used with Rainloop Plugin

I then decided to use dovecot as a dbmail replacement as it is well maintained and well-hung. So lets get started!

Setup

As soon as we have set the MX record for our domain to our mailserver we should ensure that we can receive mails there. Thats postfix’ job and it has a whole lot of configs. I found a good tutorial and reduced it for my purposes. The whole config with Ansible” automation can be found here but let us go throught the most important stuff.

Postfix

First the virtual user and domain setup:

# /etc/postfix/main.cf
virtual_transport = lmtp:unix:private/dovecot-lmtp
virtual_mailbox_domains = domain.tld another.tld
virtual_mailbox_maps = texthash:/etc/postfix/accounts
local_recipient_maps = $virtual_mailbox_maps
virtual_alias_maps = texthash:/etc/postfix/aliases

Mails for virtual mailboxes should be transferred via a unix socket to dovecot using the LMTP protocol. The domains which postfix should feel responsible for are configured via Ansible” directly into the main.cf cause they are just a few.
The addresses/accounts for which postfix should accept mail are written by Ansible” to /etc/postfix/accounts and their aliases added to /etc/postfix/aliases. I use texthash as fileformat although its slower but for a few entrys that should not matter at all and so I don’t have to run postmap everytime they change. Keep it simple.The account and alias file look like:

# /etc/postfix/accounts
example@domain.tld  stub
eg@another.tld  stub

The second column of the file is not used and the ‘stub’ is just there to keep the file format. So it’s basically a list of accountnames. The same list will later be nessesary for dovecot acoount lookups but Ansible” just pastes the this for every server in its format, so no additional work needs to be done.\

# /etc/postfix/aliases
# aliases for example@domain.tld
@domain.tld example@domain.tld
# aliases for eg@another.tld
edgaer_the_pimp@another.tld eg@another.tld

There is an catch all alias for *@domain.tld to the account of example@domain.tld and a second address for the account eg@another.tld

Then there is the whole security stuff. That is the crucial part of the postfix config. Here we decide whether to block a lot of spam even without parsen the message or accidentally create an open relay.

# /etc/postfix/main.cf
disable_vrfy_command = yes
smtpd_helo_required = yes
smtpd_data_restrictions = reject_unauth_pipelining

We don’t want spammer to be able to lookup the catch all config using the VRFY command so we just disable it.
Also we require a civilized greeting and force clients to a use SMTP without shortcuts.

# /etc/postfix/main.cf
smtpd_client_restrictions =
  permit_mynetworks,
  reject_unknown_client_hostname,

Now we come to the restriction of the various SMTP commands. We always allow our local network like localhost (permit_mynetworks) and reject clients with broken DNS<->reverse DNS (reject_unknown_client_hostname)

# /etc/postfix/main.cf
smtpd_helo_restrictions =
  permit_mynetworks,
  reject_invalid_helo_hostname,
  reject_non_fqdn_helo_hostname,
  reject_unknown_helo_hostname,

We also want the remote client to give a valid and resolveable FQDN in the HELO. With the DNS based filters postfix drops 500-600 Connections per month with log messages like this:

# mail.log
/var/log/mail.log.4.gz:Oct  4 05:55:39 novaprospekt postfix/smtpd[21911]: warning: hostname 26.189.237.221.broad.cd.sc.dynamic.163data.com.cn does not resolve to address 221.237.189.26: Name or service not known
/var/log/mail.log.4.gz:Oct  6 00:02:03 novaprospekt postfix/smtpd[32514]: warning: hostname nl1.nlkoddos.com does not resolve to address 93.174.92.223: Name or service not known
/var/log/mail.log.4.gz:Oct  6 00:30:53 novaprospekt postfix/smtpd[485]: warning: hostname mail4.sailof.com does not resolve to address 104.250.108.101
/var/log/mail.log.4.gz:Oct  6 03:41:56 novaprospekt postfix/smtpd[4261]: warning: hostname hosting-by.directwebhost.org does not resolve to address 45.227.253.131: Name or service not known

# /etc/postfix/main.cf
smtpd_sender_restrictions =
        permit_mynetworks,
        permit_sasl_authenticated,
        reject_non_fqdn_sender,
        check_sender_access texthash:/etc/postfix/check_sender_domain

In the sender address restriction we additionally allow connections which have authenticated via username & password and for everyone else we add another filter

# /etc/postfix/check_sender_domain
domain.tld       REJECT You're not one of us!
another.tld      REJECT You're not one of us!

That rejects every connection from a remote host which tries to send mail with one of our own domains as sender.

# /etc/postfix/main.cf
smtpd_recipient_restrictions =
        permit_mynetworks,
        permit_sasl_authenticated,
        reject_unauth_destination

Finally we only accept mails for domains we defined in our config (reject_unauth_destination). That prevents an open relay.

Another trick from the tutorial mentioned above is to create seperate rules for Mail client (like Webinterface or Thunderbird) and use them only on submission port 587. They are very simple

# /etc/postfix/main.cf
# Restrictions for MUAs (used by submission)
mua_relay_restrictions =
  permit_sasl_authenticated,
  reject
mua_sender_restrictions =
  permit_sasl_authenticated,
  reject
mua_client_restrictions =
  permit_sasl_authenticated,
  reject
mua_recipient_restrictions=
  permit_sasl_authenticated,
  reject

but need an extra config snippet in the master.cf

# /etc/postfix/master.cf
submission inet  n       -       n       -       -       smtpd
  -o smtpd_tls_security_level=encrypt
  -o smtpd_sasl_auth_enable=yes
  -o smtpd_sasl_type=dovecot
  -o smtpd_sasl_path=private/auth
  -o smtpd_sasl_security_options=noanonymous
  -o smtpd_sasl_local_domain=$myhostname
  -o smtpd_client_restrictions=$mua_client_restrictions
  -o smtpd_sender_restrictions=$mua_sender_restrictions
  -o smtpd_relay_restrictions=$mua_relay_restrictions
  -o smtpd_recipient_restrictions=$mua_recipient_restrictions
  -o smtpd_helo_required=no
  -o smtpd_helo_restrictions=
  -o cleanup_service_name=submission-header-cleanup
# remove specific headers for privacy reasons
submission-header-cleanup unix n - n    -       0       cleanup
    -o header_checks=regexp:/etc/postfix/submission_header_cleanup

Here we can also see the setting for the authentication. Postfix connects to a socket which dovecot provieds so that we don’t need to configure the authentication in postfix.

The last important feature in the postfix config is the removal of a few client headers that is added by some MUAs. Especially the IP of the client must not be leaked in my understanding of privacy.

# /etc/postfix/submission_header_cleanup
/^Received:/            IGNORE
/^X-Originating-IP:/    IGNORE
/^X-Mailer:/            IGNORE
/^User-Agent:/          IGNORE

Dovecot

Afterwards we want to hand over the mail to dovecot to store it in the appropriate acoount and make it available via IMAP for clients. The dovecot config is pretty much straight forward and the whole, commented config can be found here.

# /etc/dovecot/dovecot.conf
service auth {
    ### Auth socket für Postfix
    unix_listener /var/spool/postfix/private/auth {
        mode = 0660
        user = postfix
        group = postfix
    }
}

Here is the socket on which postfix can check the user credentials.

# /etc/dovecot/dovecot.conf
userdb {
    driver = static
    args = /etc/dovecot/userdb
}

And thats the user account list, very similar to the one which is used by postfix.

One interesting solution I’m slightly proud of is the user password handling. I wanted that users are able to change their password without any administrative action. Back in the days I wrote a roundcube plugin to do this directly in the dbmail MySQL database but without both dbmail and roundcube a new solution has to be found. I decided just to reuse their nextcloud account information and wrote a small SQL for that:

# /etc/dovecot/nextcloud_passwd_sql.conf
driver = mysql
connect = host=localhost dbname=nextcloud user=nextcloud-dovecot password=supersecret port=3306
password_query = SELECT REPLACE( password, '2|', '{ARGON2I}') password FROM oc_users WHERE uid_lower=REGEXP_REPLACE('%u', '@.*', '');

The query replaces the nextcloud specific hash prefix with the one dovecot uses and search for a nextcloud username which matches the local-part of the account email address in dovecot. The corresponding user table from nextcloud looks like:

# mysql> select * from nextcloud.oc_users;
+---------+-------------+----------------------------------------------------------------------------------------------------+-----------+
| uid     | displayname | password                                                                                           | uid_lower |
+---------+-------------+----------------------------------------------------------------------------------------------------+-----------+
| Example | NULL        | 2|$argon2i$v=19$m=32768,t=4,p=1$oS3RHai8LRX8gqOMCX095w$eFEEHBaVOh56whmB6hJgggHeVydrJpKMOi7T2hj5vfI | example   |
| eg      | NULL        | 2|$argon2i$v=19$m=32768,t=4,p=1$seluAwz5IZJskkOrFlevNw$KEim1FMtlLqIFb20Co1d/bK+7xj23irip9/GLiXPNpY | eg        |
+---------+-------------+----------------------------------------------------------------------------------------------------+-----------+

and the dovecot userdb file which is provided via Ansible” looks like:

# /etc/dovecot/userdb
example@domain.tld:::::::
eg@another.tld:::::::

The final tidbit was to use the Rainloop IMAP Webinterface, installed as a nextcloud plugin, with autologin. The benefit of using Rainloop as a Nextcloud plugin is that no extra Vhost configuration is nessesary, updates come automatically via Nextcloud plugin management, all contacts stored in Nextcloud are automatically available in Rainloop and last but no least, if the user files the email address in the Nextcloud account the Rainloop webinterface logs in automatically.
Well at least it will when this Pull Request is merged :-P https://github.com/pierre-alain-b/rainloop-nextcloud/pull/111

Update

A friend of mine was forced to use outlook which though the following error:

Jun 12 22:50:29 dovecot: imap-login: Error: Diffie-Hellman key exchange requested, but no DH parameters provided. Set ssh_dh=</path/to/dh.pem
Jun 12 22:50:29 dovecot: imap-login: Disconnected (no auth attempts in 0 secs): user=<>, rip=92.XX.YY.ZZ, lip=185.XX.YY.ZZ, TLS handshaking: SSL_accept() failed: error:141EC044:SSL routines:tls_construct_server_key_exchange:internal error, session=<EWRFXXXXXXXXXXXXX>

the Diffie Hellman Key file was missing. I created it with openssl dhparam 4096 > /etc/dovecot/dh.pem, took like half an hour.\
Also don’t get confused about the “ssh_dh” it’s a typo and fixed in the meantime

Reference:

https://thomas-leister.de/en/mailserver-debian-stretch/ on archive.org

Hyperconverged Garden

Wed, 29 May 2019 21:31:27 +0200

“I’m a huge fan” With this words on it this totally broken Bladecenter fan was lying for something like a year in our office. Well actually I’m not a fan of Bladecenters at all so I decided to make something useful out of this

“I’m a huge fan” With this words on it this totally broken Bladecenter fan was lying for something like a year in our office. Well actually I’m not a fan of Bladecenters at all so I decided to make something useful out of this

Dell Latitude 2nd HDD slot

Wed, 29 May 2019 20:08:16 +0200

Once I bought a Lenovo Thinkpad Edge, can’t remember the exact number, but I selected the small CPU and it was so slow, that I smashed it one day so that the display was broken. What a relief! Afterwards my flatmate gave me a Dell Latitude E4300 which I used approximately from 2014 to 2018 and besides it was also very old I was very happy with it. This thing was unbreakable and a good companion!

Cause I didn’t had the money to buy big SSDs at that time I had only 120GB SSD in the only SATA slot. But the thing had a DVD tray which I never used so it was very obviously to remove this and replace it with a HDD for slow and cheap storage. But I could afford that either, all I could by was a 1TB 2,5” HDD (7mm) which fit in the slot. So I decided to build the HDD tray myself.

I pulled out the DVD tray and removed everything inside and just kept the cover and groundplate. The HDD fit in the hight exactly but the connector was a [https://de.wikipedia.org/wiki/Serial_ATA#Slimline_SATA](SATA Slimline connector) which is quite common for CD/DVD trays. Since buying adapters would have been already way to convenient I tried to build one.

Because I don’t wanted to solder SATA cables and connector I cut out the datapin connector from the Slimline connector from DVD tray and used a SATA extension cable which has a male connector to replace it. They didn’t fit really good so I had to cut them down a bit and fixed them with as much hot glue as possible. Then I always had to fumble the whole thing inside the DVD tray to test if the connector fits. Without seeing anything inside this was a very annoying hour.

But finally I had a working Slimline connector, soldered powercables to it and glued everything together so that I can connect the harddrive. In comparison to the Slimline connector this was easy even if the SATA cable was very rigid.

Then I tried the whole thing and voila, the SATA disk spun up and was recognized by the system. Well done, I thought. But damn the SATA cable had to be turned in order to get the connector into the right orientation and it turn out that width of the cable didn’t fit hight of the tray…

So I pulled i out again and made a cut right in the middle of the cables isolation between the wires pair. Of coarse I practised this before with another SATA cable. And finally the whole glue blob fit inside into the tray :-)

Homeserver

Thu, 10 Jan 2019 02:27:38 +0100

Storage server with APU2c4 Board, btrfs and systemd-nspawn

Homeserver. Standard. Something with a bit of space for backups, entertainment and for sure to be a seedbox. I decided to build on based on an apu2c4 board cause they don’t need much power and have good reliability. Sadly the have only one SATA port so another an additional controller is needed to get a RAID running. Plan was to get at least 8TB netto capacity with 1 parity (like RAID).

But lets start from the beginning. I got a small 19” office rack in which I wanted to migrate all my home computers. Sadly it only has around 45cm usable depth cause it most probably was meant for office networks so I needed very short cases. After a while of searching for something that fits an has 3,5” HDD Trays with a backplane I found this one Case was good so far, the backplane and the miniSAS connector still give me a headache, but that’s another story.

However the case fit into the rack and I was able to fumble a small power supply and the APU Board inside with a bit mechanical adaption. In German one would call it “hemdsärmelig”.

Afterwards I got stuck for a few day while I was trying to get the APU Board running without Seabios just with a self compiled Coreboot which directly has GRUB OS as payload. But that is again, another story.

So the hardware which is being used is:

1x Intertech 2U 2404S
1x APU 2c4 Board
1x 16GB mSATA SSD
4x Seagate Barracuda 4TB ST4000DM004

Anyway the first intention was to have a working storage server I setup the following software baremetal:

Arch Linux
16GB SSD with btrfs as systemdisk
2x 4TB with btrfs mirror
smartmontools
samba for fileaccess
avahi for local discovery

No crypto! Because I still have my last homeserver around which I am not able to access anymore, but that’s a whole different story about which I won’t blog. So for now the thing works.

Update During the process of deciding for HDD to use, I read a lot about Shingled Magnetic Recording (SMR) and decided not to buy HDDs working this way, just because I mistrust new technology and the writing performance isn’t great. I decided to buy the Seagate Barracuda disks because they were confusingly cheap and I already wondered at that time why they have so much cache (256MB). According to this blogpost the drives actually are using SMR, although it’s not mentioned. -10 for Seagate!

Update In the meantime Seagate has released a list of their disks and which technology they are using.

Netjail

Mon, 19 Nov 2018 10:11:23 +0000

Introducing netjail Long time ago I felt the need to force a single application to use a specific network connection. Or to be more specific I needed to ensure that my torrent client cannot escape the VPN connection. Today the internet is full of scripts and tools to realize that under any kind of OS and many people just use container virtualization like docker or LXC. Anyway I started to write a bash script which should be able to cage an application into an extra network namespace in which only a OpenVPN connection is available.

Introducing netjail

Long time ago I felt the need to force a single application to use a specific network connection. Or to be more specific I needed to ensure that my torrent client cannot escape the VPN connection. Today the internet is full of scripts and tools to realize that under any kind of OS and many people just use container virtualization like docker or LXC.
Anyway I started to write a bash script which should be able to cage an application into an extra network namespace in which only a OpenVPN connection is available. At first I wrote down a special routing+iptables felony but then I realized, that I just can move an existing TUN Interface to another namespace. It’s called netjail [1]
Here is an example:

ip netns add netjail
ip link set tun0 netns netjail

The tricky thing is that it’s necessary to reconfigure the IPs and routes after that because the interface looses all its config when it changes the networknamespace. Normally OpenVPN configures the Interfaces it creates automatically but for special cases like this it can call bash scripts and pass them more or less every configuration parameter. So I just added few lines of bash inspired by [2] which can be called by OpenVPN during initialisation. They move the interface inside netjails network namespace and configure it there [3].

Last but not least, the script is able to directly start an application inside this new namespace. This application can only use the OpenVPN connection to talk to the outside. If OpenVPN dies, the application is offline.

Its also possible to use the script to prepare network namespaces for container usage or to push an OpenVPN interface to an already existing namespace.

Links:
[1] https://github.com/benibr/netjail
[2] http://www.naju.se/articles/openvpn-netns.html
[3] https://linux.die.net/man/8/openvpn \

HP Micro Flowerpot

Sat, 15 Sep 2018 20:41:13 +0200

A friend of mine gave me the case of an old HP Gen8 Microserver, chassis only without the board or power supply. Since the HDD cage has no real backplane but a SAS (SFF-8087) Plug directly attached to the HDD Slots and I didn’t have SAS controller at that time, the case wasn’t useful for me. Cause I want to install 19” rackmountable hardware only at home I finally decided to do something useful with that crappy HP case and reused it for the first plant in my new apartment :-)

Cause I want to install 19” rackmountable hardware only at home I finally decided to do something useful with that crappy HP case and reused it for the first plant in my new apartment :-)

Notifications under i3

Sat, 05 May 2018 22:12:00 +0200

I usually use systemd user services to get some simple tasks done on my workstations like backups of my homefolder or changing my wallpaper any hour. For some of them I want to get some notifications if they succeed or fail and since I don’t use some kind of Desktop Environment like GNOME I had to choose the components myself. Right now I glued together X.org, LightDM and i3 windowmanager under ArchLinux and I wanted something that shows me some kind of a noticeable popup and not just a line of text somewhere.

First I realized that notify-send is a standard tool to generate notifications from a shell script and pass them to some running daemon which shows the messages. I quickly looked through the usage text and tried the basic functionality:

 notify-send --urgency normal --icon wheather-storm "Backup failed" "Could not connect to backup host."

The first text is the title, the second the actual message. Sadly nothing happened cause I didn’t have any notification daemon ready yet. Therefore I search the ArchLinux Wiki and tried a few of the listed ones. My requirements where:

useable out-of-the-box
simple to config
somehow pretty graphical popup

Finally I decided to use mate-notification-daemon, a fork of the notification-daemon from the GNOME Project. It is available on ArchLinux repositories and comes with a small tool (mate-notification-properties) to configure the /usr/share/mate-notification-daemon/mate-notification-properties.ui XML File. I found a list of the available icons one can use on freedesktop.org

Links: Mirror of freedesktop.org/icon-naming-spec/icon-naming-spec-latest.html

Simple Wireless Setup

Tue, 24 Apr 2018 21:17:57 +0200

After using Networkmanager and Wicd for a few years each I craved for a less complicated networking setup on my laptop. I did some research and found that nothing out there fits my need so I decided just to use the minimal basic setup. systemd_networkd I created a simple network file to use systemd to do the DHCP: [Match] Name=wlan0 [Network] DHCP=yes [DHCP] RouteMetric=10 The Name tells systemd on which interfaces the config should be applied.

systemd_networkd

I created a simple network file to use systemd to do the DHCP:

[Match]
Name=wlan0

[Network]
DHCP=yes

[DHCP]
RouteMetric=10

The Name tells systemd on which interfaces the config should be applied. _wlp*_ would also be valid The _DHCP_ Parameter should be self explaining. _ipv4_ would also be valid. The _RouteMetric_ tells systemd that any route announced by the DHCP Server should get this specific metric. I use a lower metric for any routes on my wired interfaces to force traffic to go through the cable if possible.

wpa_supplicant

Ofcourse wpa_supplicant is necessary on Linux to use any kind of Wireless network. I use wpa_gui for the lazy hours to create the config for me. Writing/Editing the config manually is still needed sometimes

ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=network
update_config=1
ap_scan=1

network={
    ssid="karlsruhe.freifunk.net"
    key_mgmt=NONE
    auth_alg=OPEN
  #disabled=1
}

This is as simple as possible. ctrl_interface defines the socket on which wpa_ui or something similar can connect to wpa_supplicant update_config allows wpa_supplicant to overwrite the configfile if necessary. ap_scan tells wpa_supplicant to decide which network to connect to. It would be also possible to let the driver to that (ap_scan=0)

After a while I found out that wpa_gui creates the network config snippet with the disabled=1 parameter by default. This prevents wpa_supplicant from connecting to the network automatically if while by scanning. For most networks I disable this parameter manually.

Useful Links:

wpa_supplicant.conf (5)

Introducing

Mon, 09 Apr 2018 20:56:21 +0200

blog.domainmess.org This will be a random collection of stuff I do or should not do or already did. I just felt the need for some form of publication which I can control myself without depending on any kind of company. This is a static page proudly served with nginx created with hugo and nofancy theme.

blog.domainmess.org

This will be a random collection of stuff I do or should not do or already did. I just felt the need for some form of publication which I can control myself without depending on any kind of company.
This is a static page proudly served with nginx created with hugo and nofancy theme.