“Use less NAT” is a sentence I really like to hear. For a customer project we should build a high performace server for a webapplication. One of the requirements was that the ingress connections should not go though a loadbalancer or a NAT. Although that the throughput probably would not be throttled by those techniques and of course the application was crappy, old, non-performant Python software, I liked the fact that the setup requirements were different than usual.

However the concept was that for an update all application containers should be stopped, updated and then started again which could mean several minutes of downtime. So I tried to find a new concept for a rolling update.

Rerouting new TCP connections

Without a classic loadbalancer usually a application binds directly to a IP:PORT combination, receiving all connections to this combination. If a newer version of this application should take over there must be some other mechanism to reroute new incoming connections.

First I tried to teach the application to use the SO_REUSEPORT socket option but soon I figured out that it would be too complicated to use that as a rollover mechanism. The details are described in the blogpost called Loadbalancing TCP connections in the Linux kernel.

My second attempt was to use IPtables to hijack all incoming connections and use a DNAT to route them to another application container. The nice thing with IPtables and conntrack is that established connections would be still using the original routing as long as the connection stays alive but the DNAT rule would be applied to new connections. This breaks the requriement for a NAT-less handling of connections but it might be only used during the upgrade. Additionally it is also possible to use IPtables to actually do wighted loadbalancing of connections. You can read a good explenation on this blogpost

Simulation and testing

# create a container to simlulate existing application
docker run -d --rm --network host --name app-v0.1 hashicorp/http-echo -listen=:8000 -text="app-v0.1"
# create an intermediate container with a newer version
# this container will listen on a different port
docker run -d --rm --network host --name app-v0.2-tmp hashicorp/http-echo -listen=:8081 -text="app-v0.2-tmp"

# create a temporary IPtables rule to reroute the traffic
# to the intermediate container
iptables -t nat -I PREROUTING -p tcp -m tcp --dport 8000 -j DNAT --to-destination :8081

Now one can wait until all connections to container01 are finished.

conntrack -L -p tcp --dport 8000
tcp      6 431975 ESTABLISHED src=10.82.3.224 dst=10.82.3.224 sport=43014 dport=8000 src=10.82.3.224 dst=10.82.3.224 sport=8000 dport=43014 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 84 TIME_WAIT src=10.82.3.224 dst=10.82.3.224 sport=60244 dport=8000 src=10.82.3.224 dst=10.82.3.224 sport=8000 dport=60244 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
tcp      6 71 TIME_WAIT src=10.82.3.224 dst=10.82.3.224 sport=47082 dport=8000 src=127.0.0.1 dst=10.82.3.224 sport=8081 dport=47082 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
conntrack v1.4.7 (conntrack-tools): 2 flow entries have been shown.

Be careful about the value of net.netfilter.nf_conntrack_tcp_timeout_established which tells conntrack when to forget about connections. Default is 12 hours.

When all connections to the old app-v0.1 are gone, we create a new container for permanent usage and remove the DNAT rule. Then again we do a graceful shutdown of app-v0.2-tmp and tear down the intermediate container.

# create a new permanent container with the updated version
docker run -d --rm --network host --name app-v0.2 hashicorp/http-echo -listen=:8000 -text="app-v0.2"

# remove the temporary DNAT rule
iptables -t nat -D PREROUTING -p tcp -m tcp --dport 8000 -j DNAT --to-destination :8081

Et voilà we upgraded to a new software version without a loadbalancer.

Foot notes

If you wanna test this on localhost you have to use this IPtables rule since traffic for the lo interface does not go though PREROUTING chain:

iptables -t nat -A OUTPUT -p tcp -o lo --dport 8000 -j REDIRECT --to-ports 8081