φ-node: June 2013

Why is Python slow? A lot of blame lies with the interpreter's ponderous data representation. Whenever you create anything in Python, be it a lowly number, a dictionary, or some user-defined object (i.e. a CatSeekingMissileFactoryFactory), under the hood Python represents these things as a big fat structure.

Why can't an integer just be represented as, well...an integer? Being a dynamic language, Python lacks any external information about the types of the objects it encounters: how is it to know whether one blob of 64 bits represents an integer, a float, and or a heap allocated object? At the very minimum, a type tag must be tacked onto each value. Furthermore, since the Python interpreter uses reference counting for garbage collection, each object is tasked with remembering how many other objects refer to it. Throw in a few other pressing exigencies and you end with objects in Python that are significantly larger than an equivalent piece of data from a compiled language. To actually get useful work done the Python interpreter has to perform a delicate dance of wasteful indirection: checking tags here, calling some unknown function pointer there and (finally!) pulling data from the heap into registers.

The problem is not that Guido van Rossum doesn't care about performance or that Python is badly written! The internals of the interpreter are thoughtfully designed and optimized to eke out speed gains wherever the opportunity presents itself. However, Python's unholy marriage to its object representation seems to make the significant and enduring performance gap between Python and many other languages inescapable.

Didn't PyPy already solve this problem?

So, why not cast off the yoke of PyObjects? PyPy has already shown that, if you give a JIT compiler the freedom to lay things out however it wants, Python can be made a lot faster. However, all that speed comes at a terrible cost: the loss of compatibility with extension modules that rely on the old Python C API. PyPy is allowed to think of an int as just some bits in a register but your library still expects a big struct, with a type tag and refcount and all. Despite many efforts by the PyPy team to help the rest of us slowpokes transition to their otherwise very impressive system, all the libraries which expect PyObjects are too important to be abandoned (and often too large to be rewritten).

Can we do anything to make Python faster without having to give up libraries like SciPy?

How about a faster interpreter?

Earlier this year, my officemate Russell and I got it into our heads that CPython hadn't reached quite far enough into the bag of interpreter implementation tricks.

Why does Python use a stack-based virtual machine when a register-based machine might have lower dispatch overhead?
Why does CPython only perform peephole optimizations? Why not use simple dataflow analyses?
Attribute lookups are very repetitive, would some sort of runtime hints/feedback be useful for cutting down on hash function evaluations?
Why not use bit-tagging to store integers directly inside in PyObject pointers? It's a common technique used in the implementation of other high level languages. Perhaps the Python developers shouldn't have rejected it?
Call frames in Python seem a bit bloated and take a long time to set up, can we make function calls cheaper?

To try out these ideas (and a few others), we implemented a new Python interpreter called Falcon. It's not meant as a complete replacement for Python since Falcon falls back on the Python C API to implement constructs it doesn't support natively. However, though that fallback mechanism, Falcon should theoretically be able to run any Python code (albeit, some without any performance gains at all).

How much faster is Falcon? Unfortunately, not nearly as fast as we hoped. In the best cases, such as when lots of integers are being manipulated in a loop, Falcon might get up to 3X faster than the regular Python interpreter. More often, the gains hover around a meager 20%. Still, we found the project interesting enough to write a paper about it (which we submitted to the Dynamic Languages Symposium).

If you are interested in trying Falcon, you can get it on github. Let me know how it works for you!

In the early days of computer vision — before Big Data, small data, or even very much data at all — it was popular to manually construct vision algorithms out of neighborhood operations. A neighborhood operation is an image transformation which replaces each pixel with some simple function of the surrounding pixels' values. For example, if you replace each pixel with the median of its neighborhood, you get back an image which looks similar but is somewhat less detailed and (usefully) less noisy. Alternatively, instead of the median value, you can replace each pixel with the minimum or maximum of its neighbors. These two operations turned out to be useful enough to warrant getting their own names. Windowed maximum, which smears bright parts of the image and makes them grow outward, is called dilation. Windowed minimum, which makes images look pitted and creepy, is called erosion.


Dog with low self-regard	Dilation	Erosion

Using only successive applications of dilation and erosion it is possible to express a wide array of interesting image transformations. The composition of these two operators was considered important enough to warrant becoming a field unto itself called mathematical morphology. Morphological image transformations are still widely used for preprocessing, feature extraction, and if you squint you'll even see their likeness in the max-pooling operation used for subsampling by convolutional neural networks.

Python Implementation

Now that I've hopefully convinced you of their usefulness, let's see how to implement morphological image filters in Python. As a first cut, let's just loop over all the positions (i,j) in an image and at each pixel we'll loop over the surrounding k×k square of pixels to pick out the maximum. Technically, we could use a more complicated neighborhood (i.e. a surrounding circle of pixels or really any shape at all), but the humble square will do for this post.

Simple enough, but how quickly does it run? Anyone who has tried writing numerically intensive code in pure Python knows the answer will probably fall somewhere between "not well" and "would have been faster on pen and paper". Running this algorithm on a 1024×768 image with a 7×7 neighborhood on my MacBook Air takes almost half a minute. Pathetic!

Python programmers have become trained to reach for a compiled library whenever they encounter a significant performance bottleneck. In this case, SciPy comes with a full suite of morphological operators that have been implemented efficiently in C. Performing the same dilation using SciPy takes only 34 milliseconds. That's an approximately 750X performance gap between Python and native code.

Runtime Compilation à la Parakeet

So, Python's bytecode interpreter is hopelessly slow, but is there any alternative to relying on lower-level languages for performance? In my ideal world, all of the code I write would sit comfortably inside a .py file. Toward this end, I've been working on Parakeet, a runtime compiler which uses Python functions as a template for dynamically generated native code. By specializing functions for distinct argument types and then performing high-level optimizations on the typed code, Parakeet is able to run code orders of magnitude faster than Python. It's important to note that Parakeet is not capable of speeding up or even executing arbitrary programs — it's really a runtime compiler for an array oriented subset of Python. Luckily, all of the operations used to implement dilation (array indexing, looping, etc..) fit within this subset.

To get Parakeet to accelerate a Python function you have to:

Wrap that function in Parakeet's @jit decorator.
Once wrapped, any calls to the function get routed through Parakeet's type specializer, creating different versions of the function for distinct input types.
Each typed version of a function is then extensively optimized and turned into native code using LLVM.

If I stick an @jit decorator above dilate_naive, I find that it now runs in only 51 milliseconds. That is, almost 500X faster than Python, though still a lot slower than the SciPy implementation. Not bad for code that at least looks like it's written in Python.

Can We Get Faster Than SciPy?

If you poke around the source of SciPy morphology operators, you'll discover that SciPy doesn't actually use the naive algorithm above. SciPy's sagacious authors took note of the fact that windowed minima/maxima are separable filters, meaning they can be computed more efficiently by first performing 1D transformations on all the rows of the image, followed by 1D transformations on all the columns. Below is a Python implementation of this two-pass dilation which only inspects 2k neighboring pixels per output pixel, unlike the less efficient code above which has to look at k² neighbors.

When I wrap this version with Parakeet's @jit decorator, it only takes 17 milliseconds. That's 2X faster than the precompiled SciPy version!

Comparison with Numba and PyPy

Out of curiosity, I wanted to see how Parakeet's performance compares with the better known projects that are attempting to accelerate Python: Numba and PyPy.

Numba is a runtime compiler which also aims to speed up numerical code by type specializing functions for distinct argument types. Like Parakeet, Numba uses LLVM for code generation on the back end and aims to speed up algorithms which consume and manipulate NumPy arrays. Numba is more ambitious in that it seeks to support all of Python, whereas Parakeet carves a subset of the language which is amenable to efficient compilation.

Switching from Parakeet to Numba at first seemed simple, merely a matter of replacing @parakeet.jit with @numba.autojit. However, though the dilation benchmark managed to execute successfully, it was slower than CPython! To add insult to injury, Numba's compilation times were an order of magnitude slower than Parakeet. Since I know that Numba's authors are no chumps, I emailed Mark Florisson to figure out how I was misusing their compiler.

It turns out that when Numba is confronted with any construct it doesn't support directly, it switches to a (very) slow code path. From my perspective, it seemed like it had taken a half-day off and gone for a long lunch. Parakeet is very different in this regard: if Parakeet can't execute something efficiently it complains loudly during compilation and gives up.

In this case, it was the min and max builtins which were giving Numba trouble. Once I changed the filter implementation to use an inline if-expression instead — xrange(stop_i if stop_i >= m else m) instead of xrange(min(stop_i, m)) — then Numba ran with compile and execution times uncomfortably close to Parakeet's. I'm feeling the heat.

PyPy differs dramatically from both Parakeet and Numba in that rather than generating native code from within the Python interpreter it is a completely distinct implementation of the Python language. PyPy uses trace compilation to generate native code. Getting numerical code to run in PyPy can be tricky since it currently only supports a rudimentary subset of NumPy via a reimplementation called NumPyPy. The project seems to be making steady progress, but due to the daunting scope of NumPy, there's still a lot of basic functionality missing.

The source for the timings below is in the Parakeet repository. Parakeet and Numba both perform type specialization and compilation upon the first function invocation, so that time is show below separate from the execution time of a second function call. CPython's bytecode compilation is so fast as to be negligible, so I didn't even attempt to time it. Finding out how long PyPy takes to generate code from a hot path would be interesting but I have no idea how to access that kind of information, so I also left PyPy's compile times as "n/a".

	Algorithm	Compile Time	Execution Time
SciPy	Separable O(k)	n/a	34ms
CPython	Naive O(k²)	n/a	25,480ms
PyPy	Naive O(k²)	n/a	657ms
Numba	Naive O(k²)	286ms	61ms
Parakeet	Naive O(k²)	180ms	51ms
CPython	Separable O(k)	n/a	7,724ms
PyPy	Separable O(k)	n/a	429ms
Numba	Separable O(k)	407ms	19ms
Parakeet	Separable O(k)	238ms	17ms

Parakeet wins on performance over Numba by the thinnest margin. When Mark (the Numba developer) ran the same benchmark on a different computer, he actually saw Numba coming in slightly ahead. Maybe the difference doesn't even rise above the level of statistical noise? Either way, both Numba and Parakeet seem like reasonable choices for generating type-inferred native code from Python functions.

A crucial distinction between the two projects is that Parakeet has been designed as a domain specific language. Parakeet is embedded array-oriented language within Python but is not and never will be itself Python. If you wanted to, for example, create user-defined objects inside of compiled code, with Parakeet you're out of luck. The advantage of this approach is that Parakeet can guarantee that whatever it compiles will have reasonably good performance. Numba, on the other hand, can technically execute arbitrary Python but still has some implicit language boundaries demarcating what will or won't run efficiently.

Wither Parallelism?

A nice property of image filtering algorithms that I have completely ignored in this post is that they are usually perfectly parallelizable. Parakeet started out as a parallelizing compiler and only recently got stripped down to generating single core code. Parallelism is coming back in a cleaner form, so the next post will hopefully be about doing image filtering in Parakeet an order of magnitude (or two) faster.

For now, you can try out Parakeet and it might speed up your code. Or it might fail mysteriously — it's still a work in progress and you've been warned!

φ-node

Tuesday, June 25, 2013

How fast can we make interpreted Python?