CPython 3.13 went further with an experimental copy-and-patch JIT compiler -- a lightweight JIT that stitches together pre-compiled machine code templates instead of generating code from scratch. It's not a full optimizing JIT like V8's TurboFan or a tracing JIT like PyPy's;
Good news. Python 3.15 adapts Pypy tracing approach to JIT and there are real performance gains now:I can't speak for everyone on the team, but I did try the lazy basic block versioning in YJIT in a fork of CPython. The main problem is that the copy-and-patch backend we currently have in CPython is not too amenable to self-modifying machine code. This makes inter-block jumps/fallthroughs very inefficient. It can be done, it's just a little strange. Also for security reasons, we tried not to have self-modifying code in the original JIT and we're hoping to stick to that. Everything has their tradeoffs---design is hard! It's not too difficult to go from tracing to lazy basic blocks. Conceptually they're somewhat similar, as the original paper points out. The main thing we lack is the compact per-block type information that something like YJIT/Higgs has.
I guess while I'm here I might as well make the distinction:
- Tracing is the JIT frontend (region selection).
- Copy and Patch is the JIT backend (code generation).
We currently use both. PyPy uses meta-tracing. It traces the runtime itself rather than the user's code in CPython's tracing case. I did take a look at PyPy's code, and a lot of ideas in the improved JIT are actually imported from PyPy directly. So I have to thank them for their great ideas. I also talk to some of the PyPy devs.
Ending off: the team is extremely lean right now. Only 2 people were generously employed by ARM to work on this full time (thanks a lot to ARM too!). The rest of us are mostly volunteers, or have some bosses that like open source contributions and allow some free time. As for me, I'm unemployed at the moment and this is basically my passion project. I'm just happy the JIT is finally working now after spending 2-3 years of my life on it :). If you go to Savannah's website [1], the JIT is around 100% faster for toy programs like Richards, and even for big programs like tomli parsing, it's 28% faster on macOS AArch64. The JIT is very much a community effort right now.
[1]: https://doesjitgobrrr.com/?goals=5,10
PS: If you want to see how the work has progressed, click "all time" in that website, it's pretty cool to see (lower is faster). I have a blog explaining how we made the JIT faster here https://fidget-spinner.github.io/posts/faster-jit-plan.html.
I've been in the pandas (and now polars world) for the past 15 years. Staying in the sandbox gets most folks good enough performance. (That's why Python is the language of data science and ML).
I generally teach my clients to reach for numba first. Potentially lots of bang for little buck.
One overlooked area in the article is running on GPUs. Some numpy and pandas (and polars) code can get a big speedup by using GPUs (same code with import change).
The JIT work kenjin4096 describes is really promising though. If the tracing JIT in 3.15 actually sticks, a lot of this ladder just goes away for common workloads.
The daxpby example is a good one. Every time BLAS adds another special-case routine it's basically admitting the interface wasn't general enough. At some point you're just writing C with extra steps.
> 4 bytes of number, 24 bytes of machinery to support dynamism. a + b means: dereference two heap pointers, look up type slots, dispatch to int.__add__, allocate a new PyObject for the result (unless it hits the small-integer cache), update reference counts.
Would Python be a lot less useful without being maximally dynamic everywhere? Are there domains/frameworks/packages that benefit from this where this is a good trade-off?
I can't think of cases in strong statically typed languages where I've wanted something like monkey patching, and when I see monkey patching elsewhere there's often some reasonable alternative or it only needs to be used very rarely.
Strip the object model. Keep Python.
You get most of the speed back without touching a compiler, and your code gets easier to read as a side effect.
I built a demo: Dishonest code mutates state behind your back; Honest code takes data in and returns data out. Classes vs pure functions in 11 languages, same calculation. Honest Python beats compiled C++ and Swift on the same problem. Not because Python is fast, but because the object model's pointer-chasing costs more than the Python VM overhead.
Don't take my word for it. It's dockerized and on GitHub. Run it yourself: honestcode.software, hit the Surprise! button.
But it does beat the pants off of JS/TS on V8 which is quite the surprise.
Also in the surprise category is that Honest Java is more than 2x faster than dishonest c++.
In python3.14 the support is there, but 2 years ago you could just import this library and it would just work normally.
Believe it or not, when you write a blog post in a different language, it really helps to use an LLM, even just to fix your grammar mistakes etc.
I assume that’s most likely what happened here too.
I have no problem with people using AI, especially to close a language gap.
If you disclose your usage I have a _lot_ more trust that effort has been put into the writing despite the usage
If the author is willing and able to write understandable English, I'd prefer to read their version (even if it's very imperfect) than the LLM-polished version.
Alternatively, I'll happily read an article that was written in the author's native language and then translated directly to English.
This one bothered me because it's pretty clearly neither of those things, and so it reads just like any other LLM-written/LLM-polished piece.
[edit: just realised 'willing and able' might sound snarky in some way! All I meant was to acknowledge that even if you can write in a second (or third, etc.) language, you might not want to]
> The remaining difference is noise, not a fundamental language gap. The real Rust advantage isn't raw speed -- it's pipeline ownership.
I’m scarred to detect these things by my own AI usage.
That said, I think this article demonstrates that focusing on whether or not an article used AI might be focusing on the wrong “problem.” I appreciate being sensitive to the "smell" (the number of low-effort, AI posts flying around these days has made me sensitive too), but personally, I found this article both (1) easy to read and (2) insightful. I think the number of AI-written content lacking (2) is the problem.
I think almost everyone here agrees they don't want to read AI slop, but this submission clearly wasn't that as you admit yourself.
I’m not one of these rewrite in Rust types, but some isolated jobs are just so well sorted for full control system programming that the rust delegation is worth the investment imo.
Another part worth investigating for IO bound pipelines is different multiprocessing techniques. We recently got a boost from using ThreadPoolExecutor over standard multiprocessing, and careful profiling to identify which tasks are left hanging and best allocated its own worker. The price you pay though is shared memory, so no thread safety, which only works if your pipeline can be staggered
This is the "two language problem" ( I would like to hear from people who extensively used Julia by the way, which claims to solve this problem, does it really ?)
It then gives you a bunch of new problems. First and foremost that you now work in a niche language with fewer packages and fewer people who can maintain the code. Then you get the huge JIT latency. And deployment issues. And lack of static tooling which Rust and Python have.
For me, as a research software engineer writing performance sensitive code, those tradeoffs are worth it. For most people, it probably isn’t. But if you’re the kind of person who cares about the Python optimization ladder, you should look into Julia. It’s how I got hooked.
If you then want to access fully trimmed small executables then you have to start writing Julia similarly to how you write rust.
To me the fact that this is even possible blows my mind and I have tons of fun coding in it. Except when precompiling things. That is something that really needs to be addressed.
They're only "fast" compared to slow interpreted languages like Python.
nbody spectral-norm
C 2100ms 400ms
Graal 211ms 212ms
PyPy 98ms 1065ms
Seeing Graal and Pypy beat the gcc C versions suggests to me there's something wrong with the C version. Perhaps they need a -march=native or there's something else wrong. The C version would be a different implementation in the benchmark game, but usually they are highly optimised.Edit: looking at [1] the top C version uses x86 intrinsics, perhaps the article's writer had to find a slower implementation to have it running natively on his M4 Pro? It would be good to know which C version he used, there's a few at [1]. The N-body benchmark is one where they specify that the same algorithm must be used for all implementations.
[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Obviously the main point of the article is to compare different python optimisations, however "rewrite it in C/C++/rust/Go" is an option that should be considered, and none of his optimisations on his M4 Pro beat the C option on my 6-year old M1 mac mini.
The rest of the numbers in the blog post use a 500k iterations for the nbody simulation. Here's my numbers on the M1 mac using the default Clang installed with xcode:
Clang 17.0.0. 0.06s
python 3.12.11 1.59s
pypy 3.11.13. 0.23s
I used the fastest C code that doesn't use intrinsics at [1] and compiled withclang -O3 -march=native nbody-gcc-6.c -o nbody.clang6
Used the python version from [2].
[1] http://benchmarksgame-team.pages.debian.net/benchmarksgame/p...
[2] https://github.com/cemrehancavdar/faster-python-bench/tree/m...
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
However if Rust with PyO3 is part of the alternatives, then Boost.Python, cppyy, and pybind11 should also be accounted for, given their use in HPC and HFT integrations.
JAX is basically a frontend for the XLA compiler, as you note. The secret sauce is two insights - 1) if you have enough control, you can modify the layout of tensor computations and permute them so they don’t have to match that of the input program but have a more favorable memory access pattern; 2) most things are memory bound, so XLA creates fusion kernels that combine many computations together between memory accesses. I don’t know if the Apple BLAS library has fused kernels with GEMM + some output layer, but XLA is capable of writing GEMM fusions and might pick them if they autotune faster on given input/output shapes.
> But I haven't verified that in detail. Might be time to learn.
If you set the environment variable XLA_FLAGS=--dump_hlo_to=$DIRECTORY then you’ll find out! There will be a “custom-call” op if it’s dispatching to BLAS, otherwise it will have a “dot” op in the post-optimization XLA HLO for the module. See the docs:
It's just somewhat unfortunate that I have to question every number and fact presented since the writing was clearly at least somewhat AI-assisted with the author seemingly not being upfront about that at all.
[0]: https://www.muna.ai/ [1]: https://docs.muna.ai/predictors/create
Anyone have an opinion on how TS would fare in this comparison?
The benefit is that JavaScript JIT compilers have a few decades of research behind them, all the way back to Smalltalk and SELF.
So in many cases you can still stay with V8 or JavaScript Core, instead of rewriting into something else, regardless of the whole rewriting into Rust that is now fashionable.
Just a little more to parse with my eyes and a little more to type with TypeScript.
But hey, with all these cool kids with their AI coding agents, reading and handwriting code may soon be obsolete!
iirc reverse-complement reads and writes a GB, fasta and mandelbrot write, regex-redux reads, k-nucleotide reads and uses a hash table.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
> Missing @cython.cdivision(True) inserts a zero-division check before every floating-point divide in the inner loop. Millions of branches that are never taken.
I thought never taken branches were essentially free. Does this mean something in the loop is messing with the branch predictor?
Some nuance: try transpiling to a garbage collected rust like language with fast compilation until you have millions of users.
Also use a combination of neural and deterministic methods to transpile depending on the complexity.
I don't know what languages you might have in mind. "Rust-like" in what sense?
If going to complain about some of those being slow, remeber that they have various options between interpreter, bytecode, REPL, JIT and AOT.
V-lang is the one I'm tinkering with. It's like rust in terms of pattern matching as an expression, sum types, ?T instead of exceptions.
Like golang, it has shorter compile times.
I try to keep my argument abstract (that you need to lower python to something intermediate before rust) for that reason.
I don't have a whole lot of experience hand writing v-lang. Mostly machine generated from static python.
But I find it convenient for what it does. Golang that is less verbose, single binary for distribution and fewer tokens if you're using an agent.
GitHub.com/LadybugDB/ladybug-vlang has a CLI I recently wrote with an agent for a database I maintain.
Static python with design by contract can be a stronger specification than natural language. @antirez was discussing this on his social media today.
Nonetheless, I would not bet anything too serious on such a small language.
Arguably, only the explosion of AI slop (maybe it's more profitable), has slowed down the outrageous bombardment of Zig and Rust peddling. Languages which are not even in the top 10, but one would not have known that, by the number of previous HN headlines.
https://x.com/arundsharma/status/2033635906532638776
You should judge the language based on its license and what it does, not on the morality of people who wrote it and sponsored it.
Either produce an alternative language which does the same thing written by more "unshady" people and supported by a large corporation (talking about the "small language" comment here).
These human emotions won't matter in 6 months as code will be increasingly written by machines which have no such considerations.
The "AI" makes a few changes of the free content it sucked up, then gives it back based on correct prompts, for a fee. They got it for free, but you will pay a fee. If it is not something simple, it's often riddled with errors or takes a huge numbers of prompts to get it right. So some time was saved in typing, but then lost in code review and fixing errors. Humans that can understand what they are doing, troubleshoot, and architect are still needed.
People got interested in the language over wildl claims that are simply CS-wise impossible. Later, these were removed from the site and you are left with a.. not too interesting language. I'm judging it on this late phase, with a bad taste in my mouth from the former shenanigans.
> in 6 months as code will be increasingly written by machines
Come on..
So what I do now (since Claude Code) is write really bare bones (and slow) pure python implementation (like I used to do for numba, pypy or cython ready code), with minimal dependencies. Then I use the REPL, notebooks and nice plotting tools to get a real understanding of the problem space and the intricacies of my algorithm/problem at hand. When done, I let Claude add tests and I ask it to transpile to equivalent Rust and boom! a flawless 1000x speed upgrade in a minutes.
The great thing is I don't need to do the mental gymnastics to vectorize code in a write only mode like I've had to do since my Matlab days. Instead I can write simple to read for loops that follow my intent much better, and result in much more legible code. So refreshing!
And with pyO3 i can still expose the Rust lib to python, and continue to use Python for glue and plotting
I wish someone writes a stdlib without using it. My attempt from a few months ago in a repo under the py2many org.
Quite hard to lose the #1 reason people use the language for.
Most people don't even know what C-API is or why it slows things down.
Compatibility is made into a bigger deal than it is. That's the COBOL argument.
I wish the python community focused more on why openclaw and opencode are getting written in typescript, not python.
Why aren't agents more efficient at translating python code into shippable end user binaries using fast interpreted -> compiled agentic loops and attempt memory safety only for binaries/libraries with a large distribution.
The culture of calling them "Python" is one reason why JITs are so hard to gain adoption in Python, the problem isn't the dynamism (see Smalltalk, SELF, Ruby,...), rather the culture to rewrite code in C, C++ and Fortran code and still call it Python.
Python just has too strong network effects. In the early days it was between python and lua (anyone remember torchlua?). GoLang was very much still getting traction and in development.
Theres also the strong association of golang to google, c# to microsoft, and java to oracle...
Go is criminally underrated in my opinion. It ticks so many boxes I'm surprised it hasn't seen more adoption.
Rust really ticks the "it got all the design choices right" boxes, but fighting the borrow checker and understand smart pointers, lifetimes, and dispatch can be a serious cognitive handicap for me.
Go and Rust try to solve very different problems. Rust takes on a lot of complexity to provide memory safety without a garbage collector, which is fine, but also unnecessary for a lot of problems.
https://panel-panic.com https://larsdu.github.io/Dippy6502/
The issue is... I barely know the language. There are vast gaps in my knowledge (e.g. lifetimes, lifetime elison, dynamic dispatch, Box, Ref, Arc, all that)
Nor do I know much of anything about the 6502 microprocessor. I know even less from having the LLM one shot most of the project rather than grinding through it myself.
So in the age of AI, the question of how easy it is to write with a language may be less important than the question of how we should go about reading code.
Quite honestly, I don't really know why I wouldn't use Rust for a wide variety of projects with AI other than the compile/iteration time for larger projects. The compiler and static typing provide tremendously useful feedback and guardrails for LLM based programming. The use of structs and traits rather than OOP prevent a lot of long term architectural disasters by design. Deployment on the web can be done with WASM. Performance is out-of-the-box?
Writing Rust by hand though? I'm still terrible at it.
In my experience it's no faster than other better languages like Go, Rust or Kotlin.
> And for the 1% that aren't, you have 50 different flavors of making it faster.
Only for numerical code. You can't use something like Numpy to make Django or Mercurial faster.
And even when you could feasibly do the thing that everyone says to do - move part of your code to a faster language - the FFI is so painful (it always is) that you are much better just doing everything in that faster language from the start.
All of the effort you have to go through to make Python not slow is far less work than just "don't use Python". You can write Rust without thinking about performance and it will automatically be 20-200x faster than Python.
I actually did rewrite a Python project 1:1 in Rust once and it was approximately 50x faster. I put no effort into optimising the Rust code.
ok I guess the harder question is. Why isn't python as fast as javascript?
This pretty much makes it impossible to change many of the internal details, and to significantly optimize it.
If we remove this requirement, we get the alternative runtimes and if you check e.g. GraalPy, it has the same order of performance as JS, so your intuition is right. It's just that you have to drop supporting a good chunk of what people use Python for which is obviously a no go for most applications. (Note: GraalPy can actually also run some C libraries and in this case can cross-optimize across python and C!)
Actually there is a pretty easy answer: worldwide, the amount of javascript being evaluated every day is many orders of magnitude higher than the amount of python. The amount of money available for optimizing it has thus been many orders of magnitude higher as well.
I'm not just saying this to vent. I honestly wonder if we could eventually move to a norm where people publish two versions of their writing and allow the reader to choose between them. Even when the original is just a set of notes, I would personally choose to make my own way through them.
Edit: it's strange to get downvoted while also getting replies that agree with me and don't seem to object.
(Also, I thought it wasn't supposed to be possible to edit after getting a reply?)
How can you suppose that this is not a good reason to object, especially days after https://news.ycombinator.com/item?id=47340079 ?
I find the style so reflexively grating that it's honestly hard for me to imagine others not being bothered by it, let alone being bothered by others being bothered.
Especially since I looked at previous posts on the blog and they didn't have the same problem.
If the author wrote a detailed rough draft, had AI edit, reviewed the output thoroughly, and has the domain knowledge to know if the AI is correct, then this could be a useful piece.
I suspect most authors _don’t_ fall in that bucket.
It's partly just a matter of taste; we can disagree on whether that's a good reason, but I'd be surprised if there were no writing styles that you personally find offputting.
The LLM smell is also a signal of low effort, and a signal that we as readers can't rely on our usual heuristics for judging credibility. The whole thing with LLMs is that they're great at producing polished, plausible-looking outputs, but they're still prone to bullshitting and making errors that don't match the usual human patterns. (And of course they're a great tool for churning out human-initiated disinformation.) If you don't have any kind of immune response against the LLM smell, I reckon you're probably absorbing more bs than you realise.
Is low effort really a valuable signal though? Or is it what's actually in the content that's valuable? Like here readers are literally saying that they found the content valuable "but AI smell". Why is there a "but"? Would there be a similar issue if the author had contracted a human assistant to do X? Definitely not, and I see no reason why the treatment should be different for AI.
There, FTFY :D
> looks inside
> the reference implementation of language is slow
Despite its content, this blogpost also pushes this exact "language slow" thinking in its preamble. I don't think nearly enough people read past introductions for that to be a responsible choice or a good idea.
The only thing worse than this is when Python specifically is outright taught (!) as an "interpeted language", as if an implementation-detail like that was somehow a language property. So grating.
But yes, the very terminology "interpreted language" was designed for a different era and is somewhere between misleading and incomprehensible in context. (Not unlike "pass by value".)
If switching runtimes yields, say, 10x perf, and switching languages yields, say, 100x, then the language on its own was "just" a 10x penalty. Yet the presentation is "language is 100x slower". That's my gripe. And these are apparently conservative estimates as per the tables in the OP.
Not that metering "language performance" with numbers would be a super meaningful exercise to begin with, but still. The fact that most people just go with CPython does not escape me either. I do wonder though if people would shop for alternative runtimes more if the common culture was more explicitly and dominantly concerned with the performance of implementations, rather than of languages.