The article says this is about "sustained output token generation". For sustained usage, power is a huge factor for "real world performance". H100 has a peak power draw of 700W, while each of the RTX5090 has a peak power draw of 575W, for a total of 1150W.
According to the article it is 78 tokens per second for H100, and 80 tokens per second for the dual RTX5090. So you go up 400W of power in exchange for only two extra tokens per second.
Long story short there is a reason why data centers aren't using dual RTX5090 over H100. For sustained usage, you will pay for it in electricity, and in extra infrastructure to support that increased electricity draw, also extra heat generation and cooling.
Might make sense for a local personal hobby setup though.
Individual or small team use, personal or professional, when 32GB VRAM (or 32+32) is sufficient, at the cost of 3.5k$ (or 3.5+3.5) instead of 25k$.
The RTX 5090 has a VRAM cost of 110$/GB; the H100 of 310$/GB.
(And even professional use in a small team will probably not have the card run at full throttle and peak consumption all day, outside NN training projects.)
Which, from what I can gather, is a quite unrealistic setup for someone seriously considering buying a H100 for the home lab.
That said, the tensor parallelism[2] of vLLM is interesting.
[1]: https://github.com/DeutscheKI/llm-performance-tests#vllm-per...
[2]: https://rocm.blogs.amd.com/artificial-intelligence/tensor-pa...
But yes, this is purely for a single user setup. And since H100 are optimized for cost-efficient hosting of multiple concurrent users, it's kind of obvious that they are not a good choice here.
Not that I'm an expert.
That's basically "trust me, bro" and certainly not something I'd stake NDA compliance on.