Linux Network Performance Ultimate Guide
196 points
1 month ago
| 7 comments
| ntk148v.github.io
| HN
c0l0
1 month ago
[-]
This would have been such a great resource for me just a few weeks ago!

We wanted to have finally encrypt the L2 links between our DCs and got quotes from a number of providers for hardware appliances, and I was like, "no WAY this ought to cost that much!', and went off to try to build something myself that hauled Ethernet frames over a wireguard overlay network at 10Gbps using COTS hardware. I did pull it off after a tenday of work or so, undercutting the cheapest offer by about 70% (and the most expensive one by about 95% or so...), but there was a lot of intricate reading and experimentation involved.

I am looking forward to validate my understanding against the content of this article - it looks very promising and comprehensive at first and second glance! Thanks for creating and posting it.

reply
pgraf
1 month ago
[-]
If I may ask, what is your use case so that a L3 tunnel does not suffice?
reply
c0l0
1 month ago
[-]
We have a number of proprietary network appliances present in all connected locations that require unhampered L2 communication (for mostly dumb reasons I think, but what can you do...), unfortunately.
reply
freedomben
1 month ago
[-]
Are you able to share your code? I'd be fascinated to see how you would do that.
reply
jasonjayr
1 month ago
[-]
I just shared this a moment ago in another comment, but:

https://github.com/m13253/VxWireguard-Generator

https://gitlab.com/NickCao/RAIT

Both build a set of Wireguard configurations so you can setup a L2 mesh, and then run whatever routing protocol you want on them (Babel, BGP, etc)

(not the OP, but I use these the first one in my own multi-site network mesh between DO, AWS, 2x physical DC, and our office.)

reply
c0l0
1 month ago
[-]
It's not really (my) code, it's just some clever configuration, mostly done via systemd-networkd.

At the "outside", there's two NIC with SFP+ ports that are connected via single mode optical fiber that runs through the city - let's call these NICs eth0 on each of their nodes. eth0 have RFC1918 IP addresses assigned and can talk IP with each other. Between those nodes, a wireguard instance encrypts traffic in an inner RFC1918-network of its own - that is wg0 on each node. (Initially, I had IPv6 ULA networks prepared for these two pruposes, but afaict there's some important offload support missing for IPv6 in Linux still, and performance was quite severly hampered by that.) Then, each of the nodes defines a GRETAP netdev that has, as its endpoint, the peer's wireguard interface address - that interface is grt0.

Finally, on each side, another NIC SFP+ port (let's assume eth1) using a DAC plugs into the local switch uplink port. eth1 configure in promiscuous mode, and some `tc-mirred(8)` magic makes sure every frame it receives gets replayed over grt0, and every frame that is received via grt0 gets replayed over eth1.

So it kinda looks like this in a (badly "designed") ad-hoc ASCII graph:

    [SWITCH]-<dac>-[ETH1]-<tc>-[GRT0]-[WG0]-[ETH0]-<fiber>-...
... with the whole shebang replicated once more, but in reverse, on the right-hand-side of the <fiber> cable/element.

An earlier iteration I (briefly ;)) had in operation featured a Linux bridge instead of tc, but it quickly turned out that won't work with a few L2 protocols that we unfortunately need in operation across these links (and group_fwd_mask won't cut it for them either, so patching the kernel would have been necessary), while tc-mirred can actually replay L2 traffic without any restrictions.

reply
hyperman1
1 month ago
[-]
I wonder if it's worth it, with this amount of tunables, to write software to tune them automatically, gradient decent wise: Choose parameter from a whitelist at random and slightly increase or decrease them, inside a permitted range. Measure performance for a while, then undo if things got worse, do some more if things got better.
reply
samgaw
1 month ago
[-]
You might appreciate https://github.com/oracle/bpftune which does just that.
reply
dakiol
1 month ago
[-]
I find this cool, but as a software engineer I rarely get the chance to run any of the commands mentioned in the article. The reason: our systems run in containers that are stripped down versions of some Linux, and I don’t have shell access to production systems (and usually reproducing a bug on a dev or qa environment is useless because they are very different from prod in terms of load and the like).

So the only chance of running any of the commands in the article are when playing around with my own systems. I guess they would be useful too if I were working as Platform engineer.

reply
znpy
1 month ago
[-]
Most of the low level stuff wouldn’t work or would be useless anyway, as most container network interface implementation will make you work with veth pairs and will do many userspace monstrosities.

This is one of the things I don’t like much about kubernetes: the networking model assume you only have one nic (like 99.99999% of cloud instances from cloud providers) and that your application is dumb enough not to need knowledge of anything beneath.

The whole networking model could really get a 2020-era overhaul for simplification and improvement.

reply
Emigre_
1 month ago
[-]
If you have a staging environment as similar as possible to production you can experiment and analyze stuff in an environment that's production-like but where you have access, this could help, depending on the situation.
reply
betaby
1 month ago
[-]
"net.core.wmem_max: the upper limit of the TCP send buffer size. Similar to net.core.rmem_max (but for transimission)."

and then we have `net.ipv4.tcp_wmem` which bring two questions: 1. why there is no IPv6 equivalent and 2. what's the difference from `net.core.wmem_max` ?

reply
adrian_b
1 month ago
[-]
net.core.wmem_max is a maximum value, as its name says.

net.ipv4.tcp_wmem is a triple value, with minimum, default and maximum values. The maximum given here cannot exceed the previous value.

TCP is a protocol that should be the same regardless whether it is transported by IPv4 or by IPv6.

See e.g.

https://docs.redhat.com/en/documentation/red_hat_data_grid/7...

reply
betaby
1 month ago
[-]
So `net.ipv4.tcp_wmem` applies to IPv4 and IPv6? If so it's absolutely not obvious.
reply
woleium
1 month ago
[-]
The three problems of computing:

0. Cache invalidation

1. Naming things

2. Off by one errors

reply
jagged-chisel
1 month ago
[-]
Two problems
reply
totallyunknown
1 month ago
[-]
What's missing a bit here is debugging and tuning for >100 Gbps throughput. Serving HTTP at that scale often requires kTLS because the first bottleneck that appears is memory bandwidth. Tools like AMD μProf are very helpful for debugging this. eBPF-based continuous profiling is also helpful to understand exactly what's happening in the kernel and user-space. But overall, a good read!
reply
rjgonza
1 month ago
[-]
This seems pretty cool, thanks for sharing. So far, at least in my career, whenever we need "performance" we start with kernel bypass.
reply
hnaccountme
1 month ago
[-]
Thank you
reply