Tuesday, June 25, 2013

How fast can we make interpreted Python?

Why is Python slow? A lot of blame lies with the interpreter's ponderous data representation. Whenever you create anything in Python, be it a lowly number, a dictionary, or some user-defined object (i.e. a CatSeekingMissileFactoryFactory), under the hood Python represents these things as a big fat structure.

Why can't an integer just be represented as, well...an integer? Being a dynamic language, Python lacks any external information about the types of the objects it encounters: how is it to know whether one blob of 64 bits represents an integer,  a float, and or a heap allocated object? At the very minimum, a type tag must be tacked onto each value. Furthermore, since the Python interpreter uses reference counting for garbage collection, each object is tasked with remembering how many other objects refer to it. Throw in a few other pressing exigencies and you end with objects in Python that are significantly larger than an equivalent piece of data from a compiled language. To actually get useful work done the Python interpreter has to perform a delicate dance of wasteful indirection: checking tags here, calling some unknown function pointer there and (finally!) pulling data from the heap into registers.

The problem is not that Guido van Rossum doesn't care about performance or that Python is badly written! The internals of the interpreter are thoughtfully designed and optimized to eke out speed gains wherever the opportunity presents itself. However, Python's unholy marriage to its object representation seems to make the significant and enduring performance gap between Python and many other languages inescapable.

Didn't PyPy already solve this problem?

So, why not cast off the yoke of PyObjects? PyPy has already shown that, if you give a JIT compiler the freedom to lay things out however it wants, Python can be made a lot faster. However, all that speed comes at a terrible cost: the loss of compatibility with extension modules that rely on the old Python C API. PyPy is allowed to think of an int as just some bits in a register but your library still expects a big struct, with a type tag and refcount and all. Despite many efforts by the PyPy team to help the rest of us slowpokes transition to their otherwise very impressive system, all the libraries which expect PyObjects are too important to be abandoned (and often too large to be rewritten).

Can we do anything to make Python faster without having to give up libraries like SciPy?

How about a faster interpreter?

Earlier this year, my officemate Russell and I got it into our heads that CPython hadn't reached quite far enough into the bag of interpreter implementation tricks.

  • Why does Python use a stack-based virtual machine when a register-based machine might have lower dispatch overhead?
  • Why does CPython only perform peephole optimizations? Why not use simple dataflow analyses?
  • Attribute lookups are very repetitive, would some sort of runtime hints/feedback be useful for cutting down on hash function evaluations?
  • Why not use bit-tagging to store integers directly inside in PyObject pointers? It's a common technique used in the implementation of other high level languages. Perhaps the Python developers shouldn't have rejected it?
  • Call frames in Python seem a bit bloated and take a long time to set up, can we make function calls cheaper?
To try out these ideas (and a few others), we implemented a new Python interpreter called Falcon. It's not meant as a complete replacement for Python since Falcon falls back on the Python C API to implement constructs it doesn't support natively. However, though that fallback mechanism, Falcon should theoretically be able to run any Python code (albeit, some without any performance gains at all). 

How much faster is Falcon? Unfortunately, not nearly as fast as we hoped. In the best cases, such as when lots of integers are being manipulated in a loop, Falcon might get up to 3X faster than the regular Python interpreter. More often, the gains hover around a meager 20%. Still, we found the project interesting enough to write a paper about it (which we submitted to the Dynamic Languages Symposium). 

If you are interested in trying Falcon, you can get it on github. Let me know how it works for you!

13 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. funny how history repeats. back in the day, the interpreted language(s) were Lisp and Smalltalk. one of the high points of Smalltalk-related optimization was the Self project. is it too much to suggest that learning from past wins is not pythonic enough?

    ReplyDelete
  3. Hey, the title is incredibly misleading. "interpreted python" != "CPython". If you want to make CPython faster (e.g. extension modules working), you need to live within the world of refcounting, PyObject* etc.

    Making an interpreted Python faster (any) is a simpler, but still hard exercise. It's also IMO pretty pointless, since JITs are a good idea.

    ReplyDelete
    Replies
    1. Hey Maciej (or do you go by Fijal?),

      I think you're technically correct but I don't think it's a distinction that many people care about. Like you said, an interpreter that's incompatible with C extensions seems totally pointless when you can just use PyPy instead.

      I assumed that it was understood that the only sort of "interpreted Python" that anyone cares about is the kind that preserves the current programming model & ecosystem.

      Delete
  4. This is great work. I especially like how small your changes are (3000 lines), and how it keeps compatibility. That's a good result.

    Luajit2 and js implementations have shown how important baseline interpreter speed is to overall performance. Making the fastest python interpreter (CPython) faster is also useful for the majority of python users still using CPython.

    ReplyDelete
  5. Neat project - I'd be interested to see how Falcon fares relatively to base CPython when running the macro benchmarks at http://hg.python.org/benchmarks.

    We'd actually like to at least make the optimisation process in the CPython bytecode compiler smarter (see http://bugs.python.org/issue11549), it's just not at the top of any of the core developers current todo lists (we mostly leave the pursuit of blazing speed to the PyPy, Numba and Cython folks - the last serious attempt at a CPython JIT was Unladen Swallow, which resulted in a nice macro benchmark suite for Python intepreter performance measurements and many LLVM improvements, but no CPython JIT). If someone was interested in refreshing Eugene's patches on that issue, and address some of the remaining review comments, that'd be pretty cool ;)

    The dispatch loop is already optimised pretty heavily (there's some crazy macro magic that means a lot of the time it jumps straight to the following opcode and avoids the dispatch overhead entirely) limiting the gains that can be earned there, and it's *really* hard to optimize attribute lookups in general without breaking the language semantics (although Python 3.3's shared key dictionaries finally brought that improved object implementation technique to CPython, substantially reducing the typical size of instance dictionaries).

    The tagged integer experiment may be worth retrying now that int/long unification is complete and CPython only has one integer type. Guido's opposition to the idea is still a major hurdle, but if the change isn't too invasive and shows a substantial improvement on the macro benchmark suite at http://hg.python.org/benchmarks, he may be willing to listen.

    As far as call frames go... can we just put those into a rocket and fire them into the sun? Or something? :)

    ReplyDelete
    Replies
    1. Hey Nick,

      1) The optimizations in Falcon are pretty straightforward, or at least seem simple enough on a register bytecode. I'm sure a lot of what we're doing also applies to a stack machine. The patch you linked to, though, wants to do constant folding on the AST directly, which for some non-specific aesthetic reason strikes me as the wrong way to go.

      2) What are the odds that Guido would acquiesce to changing the bytecode from stack-based to register-based?

      3) I don't know *anything* about shared key dictionaries, or to be honest Python 3+ in general. I have some reading to do!

      4) I'll try bringing over more of the CPython benchmarks so we can draw broader/fairer comparisons against the standard interpreter.

      Thanks for your feedback!

      Delete
  6. Hi, I started two projects to try to speedup Python. https://bitbucket.org/haypo/astoptimizer implements different optimizations on the AST tree. http://hg.python.org/sandbox/registervm/ is a fork of Python 3.4 replacing stack-based bytecode with register-based bytecode. registervm implements also various optimizations on the bytecode like constant folding. registervm has still bugs, the project is not done.

    ReplyDelete
    Replies
    1. Hey Haypo,

      Great minds think alike? How much of a performance gain do you get with astoptimizer without using any assumptions that violate the usual semantics of Python?

      Delete
  7. There already is a Falcon programming language.
    http://www.falconpl.org/

    ReplyDelete
  8. You should check Mark Shannon's HotPy 2 (http://www.hotpy.org/). He did some awesome things, like lower level (= more optimizable) bytecode.

    ReplyDelete