wild
"The I7-4770K and preform 20k more Flops than C++" is an equally sensible statement (i.e. not)
Okay, but surely you know what they actually mean right, or are you being willfully obtuse? They are comparing CPython (the main python implementation)'s implementation that runs on the CPU with a kernel running on the GPU.
You are taking the statement too literally and forgetting it's a figure of speech, specifically metonymy.
When the author says it's millions of flops faster in a gpu than in an interpreteted programming language, it's not comparing them directly, but algorithms that run in them, so the substitution is the algorithms for the tools used to implement/run them.
It makes sense if you say "running similar logic -- like multiplying vectors and matrices -- on the CPU is millions of flops slower then on the GPU". There is no category error there.
but I find it illuminating to compare what a certain hardware can do in principle (what is possible) vs what I can "reach" as programmer within a certain system/setup
in this case NVIDIA A100 vs "Python" that does not reach a A100 (without the help of CUDA and PyTorch)
another analogy:
I find it useful to be able to compare what the fastest known way is to move a container from A to B using a certain vehicle (e.g. truck) and how that compares to how fast a person that can not drive that truck can do it + variants of it (on foot, using a cargo bike, using a boat via waterway, …)
I'm also interested in how much energy is needed, how much the hw costs and so on
Often there are many ways to do things, comparing is a great starting point for learning more
that said: Python can get to more FLOPs by changing the representation: https://docs.python.org/3/library/array.html
yes of course this is apples to oranges but that's kind of the point
it shows the vast span between specialized hardware throughput IFF you can use an A100 at its limit vs overhead of one of the most popular programming languages in use today that eventually does the "same thing" on a CPU
the interesting thing is why that is so
CPU vs GPU (latency vs throughput), boxing vs dense representation, interpreter overhead, scalar execution, layers upon layers, …
AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz).
but it is very impressive how far modern CPUs get as well (also in smart phones!)
Python is 9.75 million times faster than Python.
Tool calling, searches, cache movement if used, and even debug steps all stall the GPU waiting for the CPU.
There was a test of turning one of the under 1B Qwen3+ models into a kernel that didn't stall by the CPU as one GPU pass that saw quite a bit f perf lift over vLLM, I believe, showing this is an issue still.
Its been a month, so I don't remember more details than this.
The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD.
See e.g. https://github.com/Avafly/optimize-gemm for how you can get 2 to 3 order of magnitude just from C.