Hacker Newsnew | past | comments | ask | show | jobs | submit | kenjin4096's commentslogin

> I've never seen where the high level discussions were happening

Thanks for your interest. This is something we could improve on. We were supposed to document the JIT better in 3.15, but right now we're crunching for the 3.15 release. I'll try to get to updating the docs soon if there's enough interest. PEP 744 does not document the new frontend.

I wrote a somewhat high-level overview here in a previous blog post https://fidget-spinner.github.io/posts/faster-jit-plan.html#...

> does this mean each opcode is possibly split into two (or more?) stencils, with and without removed increfs/decrefs?

This is a great question, the answer is not exactly! The key is to expose the refcount ops in the intermediate representation (IR) as one single op. For example, BINARY_OP becomes BINARY_OP, POP_TOP (DECREF), POP_TOP (DECREF). That way, instead of optimizing for n operations, we just need to expose refcounting of n operations and optimize only 1 op (POP_TOP). Thus, we just need to refactor the IR to expose refcounting (which was the work I divided up among the community).

If you have any more questions, I'm happy to answer them either in public or email.


I saw your documentation PR, thank you!

I also did some reading and experiments, so quickly talking about things I've found out re: refcount elimination:

Previously given an expression `c = a + b`, the compiler generated a sequence of two LOADs (that increment the inputs' refcounts), then BINARY_OP that adds the inputs and decrements the refcounts afterwards (possibly deallocating the inputs).

But if the optimizer can prove that the inputs definitely will have existing references after the addition finishes (like when `a` and `b` are local variables, or if they are immortals like `a+5`), then the entire incref/decref pair could be ignored. So in the new version, the DECREFs part of the BINARY_OP was split into separate uops, which are then possibly transformed into POP_TOP_NOP by the optimizer.

And I'm assuming that although normally splitting an op this much would usually cost some performance (as the compiler can't optimize them as well anymore), in this case it's usually worth it as the optimization almost always succeeds, and even if it doesn't, the uops are still generated in several variants for various TOS cache (which is basically registers) states so they still often codegen into just 1-2 opcodes on x86.

One thing I don't entirely understand, but that's super specific from my experiment, not sure if it's a bug or special case: I looked at tier2 traces for `for i in lst: (-i) + (-i)`, where `i` is an object of custom int-like class with overloaded methods (to control which optimizations happen). When its __neg__ returns a number, then I see a nice sequence of

_POP_TOP_INT_r32, _r21, _r10.

But when __neg__ returns a new instance of the int-like class, then it emits

_SPILL_OR_RELOAD_r31, _POP_TOP_r10, _SPILL_OR_RELOAD_r01, _POP_TOP_r10, etc.

Is there some specific reason why the "basic" pop is not specialized for TOS cache? Is it because it's the same opcode as in tier1, and it's just not worth it as it's optimized into specialized uops most of the time; or is it that it can't be optimized the same way because of the decref possibly calling user code?


Update: I put up a PR to document the trace recording interpreter https://github.com/python/cpython/pull/146110

I implemented most of the tracing JIT frontend in Python 3.15, with help from Mark to clean up and fix my code. I also coordinated some of the community JIT optimizer effort in Python 3.15 (note: NOT the code generator/DSL/infra, that's Mark, Diego, Brandt and Savannah). So I think I'm able to answer this.

I can't speak for everyone on the team, but I did try the lazy basic block versioning in YJIT in a fork of CPython. The main problem is that the copy-and-patch backend we currently have in CPython is not too amenable to self-modifying machine code. This makes inter-block jumps/fallthroughs very inefficient. It can be done, it's just a little strange. Also for security reasons, we tried not to have self-modifying code in the original JIT and we're hoping to stick to that. Everything has their tradeoffs---design is hard! It's not too difficult to go from tracing to lazy basic blocks. Conceptually they're somewhat similar, as the original paper points out. The main thing we lack is the compact per-block type information that something like YJIT/Higgs has.

I guess while I'm here I might as well make the distinction:

- Tracing is the JIT frontend (region selection).

- Copy and Patch is the JIT backend (code generation).

We currently use both. PyPy uses meta-tracing. It traces the runtime itself rather than the user's code in CPython's tracing case. I did take a look at PyPy's code, and a lot of ideas in the improved JIT are actually imported from PyPy directly. So I have to thank them for their great ideas. I also talk to some of the PyPy devs.

Ending off: the team is extremely lean right now. Only 2 people were generously employed by ARM to work on this full time (thanks a lot to ARM too!). The rest of us are mostly volunteers, or have some bosses that like open source contributions and allow some free time. As for me, I'm unemployed at the moment and this is basically my passion project. I'm just happy the JIT is finally working now after spending 2-3 years of my life on it :). If you go to Savannah's website [1], the JIT is around 100% faster for toy programs like Richards, and even for big programs like tomli parsing, it's 28% faster on macOS AArch64. The JIT is very much a community effort right now.

[1]: https://doesjitgobrrr.com/?goals=5,10

PS: If you want to see how the work has progressed, click "all time" in that website, it's pretty cool to see (lower is faster). I have a blog explaining how we made the JIT faster here https://fidget-spinner.github.io/posts/faster-jit-plan.html.


Thank you for your contributions to the Python ecosystem. It definitely is inspiring to see Python, the language to which I owe my career and interest in tech, grow into a performant language year by year. This would not have been possible without individuals like you.

Tier-ups for trace-based JITs have been explored before. You can find an example here https://dl.acm.org/doi/abs/10.1145/2398857.2384630 I know LBBV isn't technically tracing, but it's quite similar, so I think similar concepts apply.


> Generally not that much has happened in 5 years, sometimes 10-15% improvements are posted that are later offset by bloat.

Sorry but unless your workload is some C API numpy number cruncher that just does matmuls on the CPU, that's probably false.

In 3.11 alone, CPython sped up by around 25% over 3.10 on pyperformance for x86-64 Ubuntu. https://docs.python.org/3/whatsnew/3.11.html#whatsnew311-fas...

3.14 is 35-45% faster than CPython 3.10 for pyperformance x86-64 Ubuntu https://github.com/faster-cpython/benchmarking-public

These speedups have been verified by external projects. For example, a Python MLIR compiler that I follow has found a geometric mean 36% speedup moving from CPython 3.10 to 3.11 (page 49 of https://github.com/EdmundGoodman/masters-project-report)

Another academic benchmark here observed an around 1.8x speedup on their benchmark suite for 3.13 vs 3.10 https://youtu.be/03DswsNUBdQ?t=145

CPython 3.11 sped up enough that PyPy in comparison looks slightly slower. I don't know if anyone still remembers this: but back in the CPython 3.9 days, PyPy had over 4x speedup over CPython on the PyPy benchmark suite, now it's 2.8 on their website https://speed.pypy.org/ for 3.11.

Yes CPython is still slow, but it's getting faster :).

Disclaimer: I'm just a volunteer, not an employee of Microsoft, so I don't have a perf report to answer to. This is just my biased opinion.


As a data point, running a Python program I've been working on lately, which is near enough entirely Python code, with a bit of I/O: (a prototype for some code I'll ultimately be writing in a lower-level language)

(macOS Ventura, x64)

- System python 3.9.6: 26.80s user 0.27s system 99% cpu 27.285 total

- MacPorts python 3.9.25: 23.83s user 0.32s system 98% cpu 24.396 total

- MacPorts python 3.13.11: 15.17s user 0.28s system 98% cpu 15.675 total

- MacPorts python 3.14.2: 15.31s user 0.32s system 98% cpu 15.893 total

Wish I'd thought to try this test sooner now. (I generally haven't bothered with Python upgrades much, on the basis that the best version will be the one that's easiest to install, or, better yet, is there already. I'm quite used to the language and stdlib as the are, and I've just assumed the performance will still be as limited as it always has been...!)


I have a benchmark program I use, a solution to day 5 of the 2017 advent of code, which is all python and negligible I/O. It still runs 8.8x faster on pypy than on python 3.14:

    $ hyperfine "mise exec python@pypy3.11 -- python e.py" "mise exec python@3.9 -- python e.py" "mise exec python@3.11 -- python e.py" "mise exec python@3.14 -- python e.py"
    Benchmark 1: mise exec python@pypy3.11 -- python e.py
      Time (mean ± σ):     148.1 ms ±   1.8 ms    [User: 132.3 ms, System: 17.5 ms]
      Range (min … max):   146.7 ms … 154.7 ms    19 runs

    Benchmark 2: mise exec python@3.9 -- python e.py
      Time (mean ± σ):      1.933 s ±  0.007 s    [User: 1.913 s, System: 0.023 s]
      Range (min … max):    1.925 s …  1.948 s    10 runs
     
    Benchmark 3: mise exec python@3.11 -- python e.py
      Time (mean ± σ):      1.375 s ±  0.011 s    [User: 1.356 s, System: 0.022 s]
      Range (min … max):    1.366 s …  1.403 s    10 runs
     
    Benchmark 4: mise exec python@3.14 -- python e.py
      Time (mean ± σ):      1.302 s ±  0.003 s    [User: 1.284 s, System: 0.022 s]
      Range (min … max):    1.298 s …  1.307 s    10 runs
     
    Summary
      mise exec python@pypy3.11 -- python e.py ran
        8.79 ± 0.11 times faster than mise exec python@3.14 -- python e.py
        9.28 ± 0.13 times faster than mise exec python@3.11 -- python e.py
       13.05 ± 0.16 times faster than mise exec python@3.9 -- python e.py
https://gist.github.com/llimllib/0eda0b96f345932dc0abc2432ab...


> [...] and I've just assumed the performance will still be as limited as it always has been...!)

Historically CPython performance has been so bad, that massive speedups were quite possible, once someone seriously got into it.


And indeed that has proven the case. But my assumption was that Python had been so obviously designed with performance so very much not in mind, that it had ended up in some local minimum from which meaningful escape would be impossible. But I didn't overthink this opinion, and I've always liked Python well enough for small programs anyway, so I don't mind having it proven wrong.


Woops, thanks for noticing, fixed!


Yeah, I believe that statement and it seems to hold true for MSVC as well. Thanks for your work inspiring all of this btw!


So it seems I was wrong, [[msvc::musttail]] is documented! I will update the blog post to reflect that.

https://news.ycombinator.com/item?id=46385526


Thanks :), that was indeed my intention. I think the previous 3.14 mistake was actually a good one on hindsight, because if I didn't publicize our work early, I wouldn't have caught the attention of Nelson. Nelson also probably wouldn't have spent one month digging into the Clang 19 bug. This also meant the bug wouldn't have been caught in the betas, and might've been out with the actual release, which would have been way worse. So this was all a happy accident on hindsight that I'm grateful for as it means overall CPython still benefited!

Also this time, I'm pretty confident because there are two perf improvements here: the dispatch logic, and the inlining. MSVC can actually convert switch-case interpreters to threaded code automatically if some conditions are met [1]. However, it does not seem to do that for the current CPython interpreter. In this case, I suspect the CPython interpreter loop is just too complicated to meet those conditions. The key point also that we would be relying on MSVC again to do its magic, but this tail calling approach gives more control to the writers of the C code. The inlining is pretty much impossible to convince MSVC to do except with `__forceinline` or changing things to use macros [2]. However, we don't just mark every function as forceinline in CPython as it might negatively affect other compilers.

[1]: https://github.com/faster-cpython/ideas/issues/183 [2]: https://github.com/python/cpython/issues/121263


I wish all self-promoting scientists and sensationalizing journalists had a fraction of the honesty and dedication to actual truth and proper communication of truths as you do. You seem to feel that it’s more important to be transparent about these kinds of technical details than other people are about their claims in clinical medical research. Thank you so much for all you do and the way you communicate about it.

Also, I’m not that familiar with the whole process, but I just wanted to say that I think you were too hard on yourself during the last performance drama. So thank you again and remember not to hold yourself to an impossible standard no one else does.


+1, reading through the post, the PR updating the documentation... thanks for being transparent, but also don't be so hard on yourself!

That was a very niche error, that you promptly corrected, no need to be so apologetic about it! And thanks for all the hard work making Python faster!


Thank you very much for the kind words, that means a lot to me!


Thanks for reading! For now, we maintain all 3 of the interpreters in CPython. We don't plan to remove the other interpreters anytime soon, probably never. If MSVC breaks the tail calling interpreter, we'll just go back to building and distributing the switch-case interpreter. Windows binaries will be slower again, but such is life :(.

Also the interpreter loop's dispatch is autogenerated and can be selected via configure flags. So there's almost no additional maintenance overhead. The main burden is the MSVC-specific changes we needed to get this working (amounting to a few hundred lines of code).

> Impact on debugging/profiling

I don't think there should be any, at least for Windows. Though I can't say for certain.


That makes sense, thanks for the detailed clarification. Having the switch-case interpreter as a fallback and keeping the dispatch autogenerated definitely reduces the long-term risk.


Got it. I'll try to set one up this weekend.


Thank you so much!!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: