FilterHN

A cache-friendly IPv6 LPM with AVX-512 (linearized B+-tree, real BGP benchmarks)

41 points

by debugga

8 hours ago

| past

| 5 comments

| github.com

| HN

▲

debugga

8 hours ago

[-]

Clean-room, portable C++17 implementation of the PlanB IPv6 LPM algorithm.

Includes: - AVX-512 SIMD path + scalar fallback - Wait-free lookups with rebuild-and-swap dynamic FIB - Benchmarks on synthetic data and real RIPE RIS BGP (~254K prefixes)

Interesting result: on real BGP + uniform random lookups, a plain Patricia trie can sometimes match or beat the SIMD tree due to cache locality and early exits.

Would love feedback, especially comparisons with PopTrie / CP-Trie.

▲

talsania

3 hours ago

[-]

254K prefixes with skewed distribution means early exits dominate, and no SIMD throughput advantage survives a branch that terminates at depth 3. The interesting edge case is deaggregation events where prefix counts spike transiently and the rebuild-and-swap FIB has to absorb a table that's temporarily 2x normal size

▲

Sesse__

5 hours ago

[-]

The obvious question, I guess: How much faster are you than whatever is in the Linux kernel's FIB? (Although I assume they need RCU overhead and such. I have no idea what it all looks like internally.)

▲

zx2c4

5 hours ago

[-]

I likewise wonder from time to time whether I should replace WireGuard's allowedips.c trie with something better: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

▲

Sesse__

4 hours ago

[-]

I use Wireguard rarely enough that the AllowedIPs concept gets me every time. It gets easier when I replace it mentally with “Route=” :-)

▲

zx2c4

3 hours ago

[-]

It's like a routing table on the way out and an ACL on the way in. Maybe an easier way to think of it.

▲

Sesse__

2 hours ago

[-]

Sure, but how does this differ from a routing table with RPF (which is default in Linux already)?

▲

zx2c4

2 hours ago

[-]

It's associated per-peer, so it assures a cryptographic mapping between src ip and public key.

▲

newman314

5 hours ago

[-]

I wonder if this would port nicely over to rustybgp.

▲

throwaway81523

4 hours ago

[-]

IPv6 longest-prefix-match (LPM).

▲

ozgrakkurt

6 hours ago

[-]

Why detect avx512 in build system instead of using #ifdef ?

▲

ozgrakkurt

2 hours ago

[-]

It actually does detect it using ifdef [0] but uses cmake stuff to avoid passing "-mavx512" kind of flags to the compiler [1].

[0] https://github.com/esutcu/planb-lpm/blob/748d19d5fbd945cefa3...

[1] https://github.com/esutcu/planb-lpm/blob/748d19d5fbd945cefa3...

▲

NooneAtAll3

5 hours ago

[-]

I wonder how this would look like in risc-v vector instructions

▲

camel-cdr

1 hour ago

[-]

The lines

    __m512i vx  = _mm512_set1_epi64(static_cast<long long>(x));
    __m512i vk  = _mm512_load_si512(reinterpret_cast<const __m512i*>(base));
    __mmask8 m  = _mm512_cmp_epu64_mask(vx, vk, _MM_CMPINT_GE);
    return static_cast<std::uint32_t>(__builtin_popcount(m));

would be replaced with:

    return __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m1(base, FANOUT), x, FANOUT), FANOUT);

and you set FANOUT to __riscv_vsetvlmax_e32m1() at runtime.

Alternatively, if you don't want a dynamic FANOUT you keep the FANOUT=8 (or another constant) and do a stripmining loop

    size_t cnt = 0;
    for (size_t vl, n = 8; n > 0; n -= vl, base += vl) {
     vl = __riscv_vsetvl_e64m1(n);
     cnt += __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m1(base, vl), x, vl), vl);
    }
    return cnt;

This will take FANOUT/VLEN iterations and the branches will be essentially almost predicted.

If you know FANOUT is always 8 and you'll never want to changes it, you could alternatively use select the optimal LMUL:

    size_t vl = __riscv_vsetvlmax_e32m1();
    if (vl == 2) return __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m4(base, 8), x, 8), 8);
    if (vl == 4) return __riscv_vcpop(__riscv_vmsge(u__riscv_vle64_v_u64m2(base, 8), x, 8), 8);
    return __riscv_vcpop(__riscv_vmsgeu(__riscv_vle64_v_u64m1(base, 8), x, 8), 8);

▲

sylware

3 hours ago

[-]

Sad it is c++.

▲

ozgrakkurt

2 hours ago

[-]

Why? It is 500 lines of pretty basic code. You can port it if you don't like C++ to any language, assuming you understand what it is.

It does look a bit AI generated though

▲

simoncion

1 hour ago

[-]

> It does look a bit AI generated though

These days, when I hear a project owner/manager describe the project as a "clean room reimplementation", I expect that they got an LLM [0] to extrude it. This expectation will not always be correct, but it'll be correct more likely than not.

[0] ...whose "training" data almost certainly contains at least one implementation of whatever it is that it's being instructed to extrude...