Loadbalancing TCP connections in the Linux kernel

During a search to make a Linux application suitable for rolling upgrades I search for a way to overtake a already bound TCP port. I realized that some versions of netcat are able to do that and also worked when I was using Dockers port forwarding feature. So I started to investigate how this works and what one could make of it.

SO_REUSEPORT

Apparently there is a kernel feature in Linux since version 3.9 meant to handle exactly that kind of problem: Multiple applications or threads shall listen on the same address:port combination. If the socket option SO_REUSEPORT is set before the application binds, other processes with the same UID can attach to the same port. A lot of software has support for this, often if the work should be spread out over multiple processes. One example is the mpm module of the Apache webserver.
SO_REUSEPORT must not be confused with the SO_REUSEADDR option. An excellent explanation of their differences and usage in other operating systems can be found on StackOverflow

Load Balancing

Now there is one problem with the reusable ports. All the sockets that bind to the same address:port combination form a group and the kernel load balances all the incoming connections in a round-robin fashion. See the socket 7 manpage:

For TCP sockets, this option allows accept(2) load distribution in a multi-threaded server to be improved by using a distinct listener socket for each thread. This provides improved load distribution as compared to traditional techniques such using a single accept(2)ing thread that distributes connections, or having multiple threads that compete to accept(2) from the same socket.

Yeah make sense for a high performance webserver but not when I wanna try to hand over a incoming connections to a new process.

Some tests

To test how this behaves I created a Apache config with Listen 80 reuseport and ListenCoresBucketsRatio 2. With this config Apache enables the SO_REUSEPORT option on its socket. Sadly there is no easy way in Linux to show all the options of a socket in human readable form.
I used knetstat to verify that Apache’s working correctly.

$ cat /proc/net/tcpstat 
Recv-Q Send-Q Local Address           Foreign Address         Stat Diag Options
     0      0 0.0.0.0:80              0.0.0.0:*               LSTN      SO_REUSEPORT=1,SO_REUSEADDR=1,SO_KEEPALIVE=0,TCP_NODELAY=0

My idea was to start two of those Apaches and simulate a fault state in one and see how the incoming connections will spread. Therefore I ran the following things as a basic simulation:

# launch to Apaches, on after the other
docker run -d --name test1 --network host httpd:bookworm -v apache-test.conf:/etc/apache2/apache2.conf 
docker run -d --name test2 --network host httpd:bookworm -v apache-test.conf:/etc/apache2/apache2.conf 

# then get the PIDs from one of the containers
ps auxf | grep apache

# and pause all the processes to prevent them from answering on their socket
for i in $(seq 6031 6062); do kill -STOP $i; done

# then test the connection distribution with
for i in $(seq 1 100); do timeout 1 curl https://localhost -k -s -o /dev/null && echo worked || echo failed; done | sort | uniq -c

The result looks somehow like this:

   46 worked
   54 failed

So in the end I realized this is not really what I was searching for. I wanted a smooth handover of a bound socket to a new one and SO_REUSEPORT is more for balancing connections between multiple processes serving the same content for performance optimization.

eBPF as usual

However, I realized afterwards that one could write a (e)BPF program and attach it to the groups of socket that are using the same port. There one has full control over how the connection are distributed although it’s not straight forward due to possible socket reordering. The nice thing although is, that one can overwrite that program during runtime without restarting the actual serving application. The socket 7 manpage says:

For use with the SO_REUSEPORT option, these options allow the user to set a classic BPF (SO_ATTACH_REUSEPORT_CBPF) or an extended BPF (SO_ATTACH_REUSEPORT_EBPF) program which defines how packets are assigned to the sockets in the reuseport group. … Sockets are numbered in the order in which they are added to the group (that is, the order of bind(2) calls for UDP sockets or the order of listen(2) calls for TCP sockets). New sockets added to a reuseport group will inherit the BPF program. When a socket is removed from a reuseport group (via close(2)), the last socket in the group will be moved into the closed socket’s position.

This is good to know but still doesn’t solve my problem %) I’ll probably write about an alternative solution soon.

EDIT: There is also a example available of how create Hot standby load balancing with SO_REUSEPORT and eBPF from Hemanth Malla.