tag:blogger.com,1999:blog-42952046337190442892024-03-17T23:03:49.971-04:00φ-nodeAlex Rhttp://www.blogger.com/profile/04049149415019007603noreply@blogger.comBlogger4125tag:blogger.com,1999:blog-4295204633719044289.post-86012123372121653122013-06-25T19:51:00.003-04:002013-06-27T12:06:40.721-04:00How fast can we make interpreted Python?<p>
Why is Python slow? A lot of blame lies with the interpreter's ponderous <a href="http://docs.python.org/2/c-api/structures.html">data representation</a>. Whenever you create anything in Python, be it a lowly number, a dictionary, or some user-defined object (i.e. a <span style="font-family: Arial, Helvetica, sans-serif;">CatSeekingMissileFactoryFactory</span>), under the hood Python represents these things as a big fat structure.
<p>
Why can't an integer just be represented as, well...an integer? Being a dynamic language, Python lacks any external information about the types of the objects it encounters: how is it to know whether one blob of 64 bits represents an integer, a float, and or a heap allocated object? At the very minimum, a type tag must be tacked onto each value. Furthermore, since the Python interpreter uses <a href="http://blogs.msdn.com/b/abhinaba/archive/2009/01/27/back-to-basics-reference-counting-garbage-collection.aspx">reference counting</a> for garbage collection, each object is tasked with remembering how many other objects refer to it. Throw in a few other pressing exigencies and you end with objects in Python that are significantly larger than an equivalent piece of data from a compiled language. To actually get useful work done the Python interpreter has to perform a delicate dance of wasteful indirection: checking tags here, calling some unknown function pointer there and (finally!) pulling data from the heap into registers.<br />
</p>
<p>
The problem is <i>not </i>that Guido van Rossum doesn't care about performance or that Python is badly written! The internals of the interpreter are thoughtfully designed and optimized to eke out speed gains wherever the opportunity presents itself. However, Python's unholy marriage to its object representation seems to make the significant and enduring <a href="http://benchmarksgame.alioth.debian.org/u64/benchmark.php?test=all&lang=python3&lang2=lua&data=u64">performance gap</a> between Python and many other languages inescapable.
</p>
<h2>Didn't PyPy already solve this problem?</h2>
<p>
So, why not cast off the yoke of PyObjects? <a href="http://pypy.org/performance.html">PyPy</a> has already shown that, if you give a JIT compiler the freedom to lay things out however it wants, Python can be made <i>a lot</i> faster. However, all that speed comes at a terrible cost: the loss of compatibility<i> </i>with extension modules that rely on the old Python C API. PyPy is allowed to think of an int as just some bits in a register <i>but </i>your library still expects a big struct, with a type tag and refcount and all. Despite many efforts by the PyPy team to help the rest of us slowpokes transition to their otherwise very impressive system, all the libraries which expect PyObjects are too important to be abandoned (and often too large to be rewritten).
</p>
<p>
Can we do anything to make Python faster without having to give up libraries like SciPy?<br />
</p>
<h2>How about a faster interpreter?</h2>
<p>Earlier this year, my officemate <a href="http://rjpower.org/wordpress/">Russell</a> and I got it into our heads that CPython hadn't reached quite far enough into the bag of interpreter implementation tricks.</p>
<ul>
<li>Why does Python use a stack-based virtual machine when a register-based machine might have <a href="http://dl.acm.org/citation.cfm?id=1328197">lower dispatch overhead</a>?</li>
<li>Why does CPython only perform <a href="https://github.com/akheron/cpython/blob/master/Python/peephole.c">peephole optimizations</a>? Why not use simple dataflow analyses?</li>
<li>Attribute lookups are very repetitive, would some sort of runtime hints/feedback be useful for cutting down on hash function evaluations?</li>
<li>Why not use bit-tagging to store integers directly inside in PyObject pointers? It's a <a href="http://stackoverflow.com/questions/3773985/why-is-an-int-in-ocaml-only-31-bits">common technique</a> used in the implementation of other high level languages. Perhaps the Python developers shouldn't have <a href="http://mail.python.org/pipermail/python-dev/2004-July/046139.html">rejected it</a>?</li>
<li>Call frames in Python seem a bit bloated and take a long time to set up, can we make function calls cheaper?</li>
</ul>
<div>
To try out these ideas (and a few others), we implemented a new Python interpreter called <a href="https://github.com/rjpower/falcon">Falcon</a>. It's not meant as a complete replacement for Python since Falcon falls back on the Python C API to implement constructs it doesn't support natively. However, though that fallback mechanism, Falcon should theoretically be able to run any Python code (albeit, some without any performance gains at all). </div>
<div>
<br /></div>
<div>
How much faster is Falcon? Unfortunately, not nearly as fast as we hoped. In the best cases, such as when lots of integers are being manipulated in a loop, Falcon might get up to 3X faster than the regular Python interpreter. More often, the gains hover around a meager 20%. Still, we found the project interesting enough to write a <a href="http://arxiv.org/abs/1306.6047">paper</a> about it (which we submitted to the <a href="http://www.dynamic-languages-symposium.org/dls-13/">Dynamic Languages Symposium</a>). </div>
<div>
<br /></div>
<p>
If you are interested in trying Falcon, you can get it on <a href="https://github.com/rjpower/falcon">github</a>. Let me know how it works for you!
</p>
Alex Rhttp://www.blogger.com/profile/04049149415019007603noreply@blogger.com13tag:blogger.com,1999:blog-4295204633719044289.post-7293919150089597302013-06-04T21:19:00.002-04:002013-06-05T22:16:06.805-04:00Faster image filters in Python with Parakeet <div dir="ltr" style="text-align: left;" trbidi="on">
<p>
In the <a href="http://www.icg.tugraz.at/Members/hgrabner/historyOfCV/kropatsch.pdf">early days of computer vision</a> — before Big Data, small data, or even very much data at all — it was popular to manually construct vision algorithms out of <a href="https://en.wikipedia.org/wiki/Neighborhood_operation"><i>neighborhood operations</i></a>.
A neighborhood operation is an image transformation which replaces each pixel with some simple function of the surrounding pixels' values. For example, if you replace each pixel with the <a href="http://en.wikipedia.org/wiki/Median_filter">median of its neighborhood</a>, you get back an image which looks similar but is somewhat less detailed and (usefully) less noisy. Alternatively, instead of the median value, you can replace each pixel with the minimum or maximum of its neighbors.
These two operations turned out to be useful enough to warrant getting their own names.
Windowed maximum, which smears bright parts of the image and makes them grow outward, is called <i><a href="http://scikit-image.org/docs/dev/auto_examples/applications/plot_morphology.html?highlight=erosion#dilation">dilation</a></i>.
Windowed minimum, which makes images look pitted and creepy, is called <i><a href="http://scikit-image.org/docs/dev/auto_examples/applications/plot_morphology.html?highlight=erosion#erosion">erosion</a></i>.
</p>
<p>
<center>
<table style="text-align:center;" border = "0" cellpadding="2px;" cellspacing="0px">
<tr>
<td>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhV776xhWqyMOhV8aAsZKYfa8TnMedEPhC4lzCKweH7dNAG1EgypEQwWnsy0BX-7x3636bAenRhcOZmCXN70N1VqeMheKKU8cH7fhROQNzla8akKhQ1slq24fhKTP129pj0Oh1o1SzysIcF/s1600/disgraced_dog.png" imageanchor="1"
style="padding-left: 0.5em; padding-right: 0.5em;">
<img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhV776xhWqyMOhV8aAsZKYfa8TnMedEPhC4lzCKweH7dNAG1EgypEQwWnsy0BX-7x3636bAenRhcOZmCXN70N1VqeMheKKU8cH7fhROQNzla8akKhQ1slq24fhKTP129pj0Oh1o1SzysIcF/s200/disgraced_dog.png" width="180" /></a>
</td>
<td>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJ9a0vU3Lwpz8Bj7KH7nqO6uHyNp0OVROAwu7pER9xXCunuDO4LAl4V-adjVUo9UbQSE2yDm7KAXqsq-2PyAna0VN8K0Q5gZBlgwZabYg7Mu4-55TeAWLtc-neTktPHfGSN3-z358j6Owu/s1600/disgraced_dog_dilate.png" imageanchor="1"
style="padding-left: 0.5em; padding-right: 0.5em;">
<img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJ9a0vU3Lwpz8Bj7KH7nqO6uHyNp0OVROAwu7pER9xXCunuDO4LAl4V-adjVUo9UbQSE2yDm7KAXqsq-2PyAna0VN8K0Q5gZBlgwZabYg7Mu4-55TeAWLtc-neTktPHfGSN3-z358j6Owu/s200/disgraced_dog_dilate.png" width="180" /></a>
</td>
<td>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjkt3-ZeRZClwwyh-hgBwSoggCrYZul8BqrDuEX5tTzrNIoFNvXzXYvGmczudXxajeE33Z-396K9GLeXe86q871_yUfEzwnW6i06T12YPwteO4BskyG6QnfxX-LHeSIm8yGunrJPAFa1EDV/s1600/disgraced_dog_erode.png" imageanchor="1"
style="padding-left: 0.5em; padding-right: 0.5em;">
<img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjkt3-ZeRZClwwyh-hgBwSoggCrYZul8BqrDuEX5tTzrNIoFNvXzXYvGmczudXxajeE33Z-396K9GLeXe86q871_yUfEzwnW6i06T12YPwteO4BskyG6QnfxX-LHeSIm8yGunrJPAFa1EDV/s200/disgraced_dog_erode.png" width="180" />
</a>
</td>
<tr>
<td><i>Dog with low self-regard</i></td>
<td><i>Dilation</i></td>
<td><i>Erosion</i></td>
</tr>
</table>
</center>
</p>
<p>
Using only successive applications of dilation and erosion it is possible to express a wide array of interesting <a href="http://docs.opencv.org/doc/tutorials/imgproc/opening_closing_hats/opening_closing_hats.html">image transformations</a>. The composition of these two operators was considered important enough to warrant becoming a field unto itself called <a href="http://en.wikipedia.org/wiki/Mathematical_morphology">mathematical morphology</a>. Morphological image transformations are still widely used for preprocessing, feature extraction, and if you squint you'll even see their likeness in the <a href="http://deeplearning.net/tutorial/lenet.html#maxpooling">max-pooling</a> operation used for subsampling by <a href="http://videolectures.net/machine_krizhevsky_imagenet_classification/">convolutional neural networks</a>.
</p>
<h2>Python Implementation</h2>
<p>
Now that I've hopefully convinced you of their usefulness, let's see how to implement morphological image filters in Python. As a first cut, let's just loop over all the positions <i>(i,j)</i> in an image and at each pixel we'll loop over the surrounding <i>k</i>×<i>k</i> square of pixels to pick out the maximum. Technically, we could use a more complicated neighborhood (i.e. a surrounding circle of pixels or really any shape at all), but the humble square will do for this post.
</p>
<script src="https://gist.github.com/iskandr/5711019.js"></script>
<p>
Simple enough, but how quickly does it run? Anyone who has tried writing numerically intensive code in pure Python knows the answer will probably fall somewhere between "not well" and "would have been faster on pen and paper". Running this algorithm on a 1024×768 image with a 7×7 neighborhood on my MacBook Air takes almost half a minute. Pathetic!</p>
<p>Python programmers have become trained to reach for a compiled library whenever they encounter a significant performance bottleneck. In this case, SciPy comes with a full suite of <a href="http://docs.scipy.org/doc/scipy/reference/ndimage.html#module-scipy.ndimage.morphology">morphological operators</a> that have been implemented efficiently in C. Performing the same dilation using SciPy takes only <strong>34 milliseconds</strong>. That's an approximately 750X performance gap between Python and native code.</p>
<h2>Runtime Compilation à la Parakeet</h2>
<p>
So, Python's <a href="http://en.wikipedia.org/wiki/CPython">bytecode interpreter</a> is hopelessly slow, but is there any alternative to relying on lower-level languages for performance? In my ideal world, all of the code I write would sit comfortably inside a <code>.py</code> file. Toward this end, I've been working on <a href="https://github.com/iskandr/parakeet">Parakeet</a>, a runtime compiler which uses Python functions as a template for dynamically generated native code. By specializing functions for distinct argument types and then performing high-level optimizations on the typed code, Parakeet is able to run code orders of magnitude faster than Python. It's important to note that Parakeet is <i>not</i> capable of speeding up or even executing arbitrary programs — it's really a runtime compiler for an array oriented subset of Python. Luckily, all of the operations used to implement dilation (array indexing, looping, etc..) fit within this subset.
</p>
<p>
To get Parakeet to accelerate a Python function you have to:
<ul>
<li>Wrap that function in Parakeet's <code>@jit</code> decorator.</li>
<li>Once wrapped, any calls to the function get routed through Parakeet's type specializer, creating different versions of the function for distinct input types.</li>
<li>Each typed version of a function is then extensively optimized and turned into native code using <a href="http://stackoverflow.com/a/890430">LLVM</a>.</li>
</ul>
</p>
<p>
If I stick an <code>@jit</code> decorator above <code>dilate_naive</code>, I find that it now runs in only <strong>51 milliseconds</strong>. That is, almost 500X faster than Python, though still a lot slower than the SciPy implementation. Not bad for code that at least looks like it's written in Python.
</p>
<h2>Can We Get Faster Than SciPy?</h2>
<p>
If you poke around the <a href="https://github.com/scipy/scipy/blob/v0.12.0/scipy/ndimage/morphology.py#L1277">source of SciPy morphology operators</a>, you'll discover that SciPy doesn't actually use the naive algorithm above. SciPy's sagacious authors took note of the fact that windowed minima/maxima are <a href="http://www.cb.uu.se/~cris/blog/index.php/archives/288">separable filters</a>, meaning they can be computed more efficiently by first performing 1D transformations on all the rows of the image, followed by 1D transformations on all the columns. Below is a Python implementation of this two-pass dilation which only inspects <code>2k</code> neighboring pixels per output pixel, unlike the less efficient code above which has to look at <code>k<sup>2</sup></code> neighbors.
</p>
<script src="https://gist.github.com/iskandr/5711014.js"></script>
<p>When I wrap this version with Parakeet's <code>@jit</code> decorator, it only takes <strong>17 milliseconds</strong>. That's 2X faster than the precompiled SciPy version!
</p>
<h2>Comparison with Numba and PyPy</h2>
<p>
Out of curiosity, I wanted to see how Parakeet's performance compares with the better known projects that are attempting to accelerate Python: <a href="https://github.com/numba/numba">Numba</a> and <a href="http://www.ibm.com/developerworks/library/os-pypy-intro/">PyPy</a>.
</p>
<P>Numba is a runtime compiler which also aims to speed up numerical code by type specializing functions for distinct argument types. Like Parakeet, Numba uses LLVM for code generation on the back end and aims to speed up algorithms which consume and manipulate NumPy arrays. Numba is more ambitious in that it seeks to support all of Python, whereas Parakeet carves a subset of the language which is amenable to efficient compilation.</p>
<p>Switching from Parakeet to Numba at first seemed simple, merely a matter of replacing <code>@parakeet.jit</code> with <code>@numba.autojit</code>. However, though the dilation benchmark managed to execute successfully, it was slower than CPython! To add insult to injury, Numba's compilation times were an order of magnitude slower than Parakeet. Since I know that <a href="http://www.continuum.io/">Numba's authors</a> are no chumps, I emailed <a href="https://github.com/markflorisson88">Mark Florisson</a> to figure out how I was misusing their compiler. </p>
<p>It turns out that when Numba is confronted with any construct it doesn't support directly, it switches to a (very) slow code path. From my perspective, it seemed like it had taken a half-day off and gone for a long lunch. Parakeet is very different in this regard: if Parakeet can't execute something efficiently it complains loudly during compilation and gives up.
</p>
<p>In this case, it was the <code>min</code> and <code>max</code> builtins which were giving Numba trouble. Once I changed the filter implementation to use an inline if-expression instead — <code>xrange(stop_i if stop_i >= m else m)</code> instead of <code>xrange(min(stop_i, m))</code> — then Numba ran with compile and execution times uncomfortably close to Parakeet's. I'm feeling the heat.</p>
<p>PyPy differs dramatically from both Parakeet and Numba in that rather than generating native code from within the Python interpreter it is a completely distinct implementation of the Python language. PyPy uses <a href="http://dl.acm.org/citation.cfm?id=1565827">trace compilation</a> to generate native code. Getting numerical code to run in PyPy can be tricky since it currently only supports a rudimentary subset of NumPy via a reimplementation called <a href="http://morepypy.blogspot.com/2012/01/numpypy-status-update.html">NumPyPy</a>. The project seems to be making steady progress, but due to the daunting scope of NumPy, there's still a lot of basic functionality missing.
</p>
<p>The source for the timings below is in the <a href="https://github.com/iskandr/parakeet/blob/master/examples/morphology.py">Parakeet repository</a>. Parakeet and Numba both perform type specialization and compilation upon the first function invocation, so that time is show below separate from the execution time of a second function call. CPython's bytecode compilation is so fast as to be negligible, so I didn't even attempt to time it. Finding out how long PyPy takes to generate code from a hot path would be interesting but I have no idea how to access that kind of information, so I also left PyPy's compile times as "<i>n/a</i>". </p>
<center>
<table border="0px" cellpadding="10" cellspacing="2">
<tbody>
<tr>
<td></td>
<td style="background-color: pink;">Algorithm</td>
<td style="background-color: pink;">Compile Time</td>
<td style="background-color: pink;">Execution Time</td>
</tr>
<tr>
<td><b>SciPy</b></td>
<td>Separable O(k)</td>
<td><i>n/a</i></td>
<td>34ms</td>
</tr>
<tr>
<td><b>CPython</b></td>
<td>Naive O(k<sup>2</sup>)</td>
<td><i>n/a</i></td>
<td>25,480ms</td>
</tr>
<tr>
<td><b>PyPy</b></td>
<td>Naive O(k<sup>2</sup>)</td>
<td><i>n/a</i></td>
<td>657ms</td>
</tr>
<tr>
<td><b>Numba</b></td>
<td>Naive O(k<sup>2</sup>)</td>
<td>286ms</td>
<td>61ms</td>
</tr>
<tr>
<td><b>Parakeet</b></td>
<td>Naive O(k<sup>2</sup>)</td>
<td>180ms</td>
<td>51ms</td>
</tr>
<tr>
<td><b>CPython</b></td>
<td>Separable O(k)</td>
<td><i>n/a</i></td>
<td>7,724ms</td>
</tr>
<tr>
<td><b>PyPy</b></td>
<td>Separable O(k)</td>
<td><i>n/a</i></td>
<td>429ms</td>
</tr>
<tr>
<td><b>Numba</b></td>
<td>Separable O(k)</td>
<td>407ms</td>
<td>19ms</td>
</tr>
<tr>
<td><b>Parakeet</b></td>
<td>Separable O(k)</td>
<td>238ms</td>
<td>17ms</td>
</tr>
</tbody></table>
</center>
</p>
<p>
Parakeet wins on performance over Numba by the thinnest margin. When Mark (the Numba developer) ran the same benchmark on a different computer, he actually saw Numba coming in slightly ahead. Maybe the difference doesn't even rise above the level of statistical noise? Either way, both Numba and Parakeet seem like reasonable choices for generating type-inferred native code from Python functions.
</p>
<p>A crucial distinction between the two projects is that Parakeet has been designed as a <a href="http://en.wikipedia.org/wiki/Domain-specific_language">domain specific language</a>. Parakeet is embedded array-oriented language within Python but is not and never will be itself Python. If you wanted to, for example, create user-defined objects inside of compiled code, with Parakeet you're out of luck. The advantage of this approach is that Parakeet can guarantee that whatever it compiles will have reasonably good performance. Numba, on the other hand, can technically execute arbitrary Python but still has some implicit language boundaries demarcating what will or won't run efficiently.
</p>
<h2>Wither Parallelism?</h2>
<p>
A nice property of image filtering algorithms that I have completely ignored in this post is that they are usually perfectly parallelizable. Parakeet started out as a <i>parallelizing</i> compiler and only recently got stripped down to generating single core code. Parallelism is coming back in a cleaner form, so the next post will hopefully be about doing image filtering in Parakeet an order of magnitude (or two) faster.
</p>
<p>
For now, you can <a href="https://github.com/iskandr/parakeet">try out</a> Parakeet and it might speed up your code. Or it might fail mysteriously — it's still a work in progress and you've been warned!
</p>
</div>Alex Rhttp://www.blogger.com/profile/04049149415019007603noreply@blogger.com15tag:blogger.com,1999:blog-4295204633719044289.post-54351199311469662242013-01-27T20:10:00.000-05:002013-01-31T14:21:11.442-05:00A transformative journey from Python to native code<div dir="ltr" style="text-align: left;" trbidi="on">
<p>
When
<a href="http://stackoverflow.com/questions/2933434/strengths-and-weaknesses-of-jit-compilers-for-python">somebody</a>
offers to compile your Python code, exactly what kind of mischief are you getting yourself into?
What
<a href="http://playingwithpointers.com/archives/1010">diabolical</a>
schemes does a
<a href="http://stackoverflow.com/questions/95635/what-does-a-just-in-time-jit-compiler-do">just-in-time compiler</a>
enact to transmute sluggish Python code into something speedier?
</p>
<p>
Toward the end of my
<a href="http://www.phi-node.com/2013/01/a-journey-through-parakeet.html">last post</a>,
I mentioned that I'm working on a library called
<a href="https://github.com/iskandr/parakeet">Parakeet</a>
which accelerates numerical Python. In this post, I'm going to illuminate mysterious inner workings of a just-in-time compiler by following a function through its various stages of existence within Parakeet. </p>
<p>
<i>Caveat: Parakeet isn't finished, and it's awkward to write so extensively about something I don't yet want anyone using. Nonetheless, Parakeet is the compiler I know best and its relatively simple design will hopefully allow me to convey a general sketch of how JITs work.
</i>
</p>
<p>
The function we're going radically rearrange today is <code>count_thresh</code>, which sums up the number of elements in an array less than a given threshold.
</p>
<script src="https://gist.github.com/4669559.js"></script>
<p>This little ditty of looping and summation is simple enough that I hope its relationship to the code we later generate will stay evident. It's not, however, an entirely contrived computation. If you were to throw in an array of class labels and dash more complexity, you would soon have the core of a <a href="http://scikit-learn.org/dev/modules/tree.html#mathematical-formulation">decision tree learning algorithm</a>.
</p>
<!--
<p>Now, before break <code>count_thresh</code> down into a lean, mean, faster-than-Python number crunching machine, let's first look at how the original gets executed within Python.</p>
<p> <i>Note: there are actually a <a href="http://wiki.python.org/moin/PythonImplementations#Working_Implementations">multitude</a> of competing Python implementations but when I write "Python" I mean the standard <a href="http://www.youtube.com/watch?v=XGF3Qu4dUqk">CPython</a> interpreter.</i>
</p>
<p>A reminder from <i>Compilers 101</i>: The first thing that happens to a function on its way to being executed is <a href="http://en.wikipedia.org/wiki/Parsing#Programming_languages">parsing</a>. The source of <code>count_thresh</code> gets read as a string of characters and then, after the tedious business of <a href="http://docs.python.org/2/library/tokenize.html">tokenization</a>, turned into a structured syntax tree:
</p>
<script src="https://gist.github.com/4669223.js"></script>
<p>
A naive interpreter would then <a href="http://norvig.com/lispy.html">execute the syntax tree directly</a>. Python achieves a minor performance boost by instead compiling to a more compact <a href="http://nedbatchelder.com/blog/200804/the_structure_of_pyc_files.html">bytecode</a>:
</p>
<script src="https://gist.github.com/4669581.js"></script>
<p>
The relative
<a href="http://lambda-the-ultimate.org/node/2884#comment-42662">inefficiency </a>
of tree-walking interpreters (those which evaluate syntax trees) is one of the reasons that Ruby has generally been
<a href="http://benchmarksgame.alioth.debian.org/u32/benchmark.php?test=all&lang=yarv&lang2=python3">slower</a> than Python. Though an improvement over a naive interpreter, trying to execute this bytecode directly is still a performance nightmare. If you inspect the <a href="http://svn.python.org/view/python/trunk/Python/ceval.c?view=markup">behavior</a> of the above <a href="http://docs.python.org/2/library/dis.html#bytecodes">instructions</a>, you'll discover that they are a rat's nest of repetitive <a href="http://stackoverflow.com/a/25324">un-boxing and re-boxing of numeric values</a> in and out of their PyObject wrappers, wasteful stack manipulation, and all sorts of other despicably laggardly behavior. If we're going to significantly speed up the numerical performance of Python code, it's going to have run somewhere other than the Python bytecode interpreter.
</p>
<br/>
<h3>Compile a whole function, or just a trace?</h3>
<p>
There are many different approaches to run-time compilation, but they generally fall into two categories: <i>function specialization</i> and <i>tracing</i>. A function-specializing JIT enters a function with some knowledge or assumption about its inputs and propagates that information to elide unused dynamicism and perform local optimizations: <i>"your inputs are both integers— what else then has to be an integer?"</i>. A specialized function is more amenable for translation into efficient lower-level code. The major challenges with function-specializing compilers are (1) <a href="http://en.wikipedia.org/wiki/Inline_caching">managing different specialized versions of functions</a> and (2) propagating information from the inputs to local variables.
</p>
<p>Isn't it arbitrary, though, to treat function boundaries as the basic unit of compilation? And why are we trying to compile the <i>entire</i> function when most of our execution time will probably be spent inside some handful of tight loops? This line of thinking leads to <a href="http://en.wikipedia.org/wiki/Tracing_just-in-time_compilation">tracing</a> JITs which, as their name suggests, trace the exact instructions your function executes and then selectively compiles traces deemed likely to run often in the future. An advantage of tracing is that there is no need for any sort of inference/propagation to uncover optimization opportunities: every instruction in the trace comes with perfect knowledge of which values/types it encountered. The disadvantages of trace-based compilation are (1) <a href="http://lambda-the-ultimate.org/node/3851#comment-57679">trace management</a>, (2) <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=580468">heuristics</a> for when to compile a trace, and (3) the need to implement a <a href="http://playingwithpointers.com/archives/1010">fast interpreter</a> so that code which doesn't get compiled still has acceptable baseline performance.
</p>
<br/>
<h3>So, what does Parakeet do?</h3>
<p>
-->
Compared with the <a href="http://blog.cdleary.com/2011/06/mapping-the-monkeysphere/">wild menagerie</a> of run-time compilation techniques that have developed over the past decade, Parakeet is a relatively modest function-specializing compiler. If you want Parakeet to compile a particular function, then wrap that function with the <code>@jit</code>
<a href="http://www.siafoo.net/article/68">decorator</a>.
Like this:
</p>
<!-- Python code w/ @jit -->
<script src="https://gist.github.com/4674729.js"></script>
<p>
The job of <code>@jit</code> is to intercept calls into the wrapped function and then to initiate the following chain of events:
<ul>
<li>Translate the function into an untyped representation, from which we'll later derive multiple type specializations.</li>
<li>Specialize the untyped function for any argument types which get passed in.</li>
<li>Optimize the merciless heck out of the typed code, and translate abstractions such as tuples and n-dimensional arrays into simple heap-allocated structures with low-level access code.</li>
<li>Translate the optimized and lowered code into <a href="http://llvm.org/">LLVM</a>, and let <a href="http://nondot.org/sabre/">someone else</a> worry <a href="http://llvm.org/docs/Passes.html#transform-passes">lower-level optimizations</a> and how to generate architecture-specific native code.
</ul>
An eager reader may be thinking:<br />
<blockquote>
I can just stick that decorator atop any Python function and it will magically run faster? Great! I'll paste @jit all over my code and my Python performance problems will be solved!
</blockquote>
<p>
Easy with those decorators! Parakeet is <i>not</i> a general-purpose compiler for all of Python.
Parakeet only supports a handful of Python's data types: <a href="http://docs.python.org/2/library/stdtypes.html#typesnumeric">numbers</a>, <a href="http://infohost.nmt.edu/tcc/help/pubs/python/web/tuple-type.html">tuples</a>, <a href="http://docs.python.org/2.7/glossary.html#term-slice">slices</a>, and NumPy <a href="http://www.scipy.org/Cookbook/BuildingArrays">arrays</a>.
</p>
<p>
To manipulate these values, Parakeet lets you use any of the usual math and logic operators, along with some, but not all, of the <a href="http://docs.python.org/2.7/library/functions.html">built-in functions</a>. Functions such as <code><a href="http://docs.python.org/release/1.5.1p1/tut/range.html">range</a></code> are compiled to deviate from their usual behavior — in Python their result would be a list but in Parakeet such functions create NumPy arrays.
</p>
<!--
To make amends for taking away so many of your favorite language's features, Parakeet also gives you a handful of <i><a href="http://www.drdobbs.com/structured-patterns-an-overview/223101515">data parallel operators</a></i> which we use implement NumPy array expressions, enable high-level optimizations and (as the name suggests) parallel execution.
-->
<p>
If your performance <a href="http://www.huyng.com/posts/python-performance-analysis/">bottleneck</a> doesn't fit neatly into Parakeet's restrictive universe then you might benefit from a faster <a href="http://speed.pypy.org/">Python implementation</a>, or alternatively you could outsource some of your functionality to native code via <a href="http://cython.org/">Cython</a>.
</p>
<p>
Let's continue, following <code>count_thresh</code> on its inexorable march toward efficiency.
</p>
<br/>
<h3>From Python into Parakeet</h3>
<p>When trying to extract an executable representation of a Python function, we face a <a href="http://librelist.com/browser//numba/2012/7/2/trying-to-understand-the-bytecode-vs-ast-business/">choice</a> between using its <a href="http://docs.python.org/2.7/library/ast.html#abstract-grammar">syntax tree</a>: </p>
<!-- AST -->
<script src="https://gist.github.com/4669223.js"></script>
<p>...or the lower-level <a href="http://nedbatchelder.com/blog/200804/the_structure_of_pyc_files.html">bytecode</a> which the Python interpreter actually executes:</p>
<script src="https://gist.github.com/4669581.js"></script>
<!-- Bytecode -->
<p>
Neither is ideal for program analysis and transformation, but since the bytecode is littered with distracting stack manipulation and discards some of the higher-level language constructs, let's start with a syntax tree and quickly slip into something a little more <a href="https://github.com/iskandr/parakeet/blob/master/parakeet/syntax.py">domain specific</a>.
</p>
<br/>
<h3>Untyped Representation</h3>
<p>
When a function is handed over from Python into Parakeet, it is translated into a form that is mostly similar to an ordinary Python syntax tree. There, are however, a few key differences:
<ul>
<li>For-loops must traverse numeric ranges, so <code>for x in xs</code> gets translated into <code>for i in range(len(xs)): x = xs[i]</code></li>
<li>There is a suspicious looking <code>phi</code> expression at the top of the loop. What is that thing? Does it have anything to do with the name of this site?</li>
</ul>
</p>
<!-- Parakeet untyped -->
<script src="https://gist.github.com/4669594.js"></script>
<p>
Looking even closer at the code above, you'll notice the variable <code>n</code> has been split apart into three distinct names: <code>n</code>, <code>n2</code> and <code>n_loop</code>. What can account for such triplicative witchery?
</p>
<p>
Calm your agitation, dear reader. You're looking at a variant of <a href="http://en.wikipedia.org/wiki/Static_single_assignment_form">Static Single Assignment form</a>. I'll write about SSA in more detail later, but for now the most important things to know about it are:
<ul>
<li>Every distinct assignment to a variable in the original program becomes the creation of distinct variable. Reminiscent of <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.3282">functional programming</a>, no?</li>
<li>At a point in the program where control flow could have come from multiple places (such as the top of a loop), we explicitly denote the possible sources of a variable's value using a φ-node.</li>
<li>In exchange for all these variable name gymnastics we get a tremendous simplification in the onerous task of writing program analyses. It may not be immediately obvious why, but this post is already too long, so trust me for now.</li>
</ul>
</p>
<p>
Another difference from Python is that Parakeet's representation treats many array operations as <a href="http://stackoverflow.com/questions/646794/what-is-a-first-class-programming-construct">first-class</a> constructs. For example, in ordinary Python <code>len</code> is a library function, whereas in Parakeet it's actually part of the language syntax and thus can be analyzed with higher-level knowledge of its behavior. This is particular useful for inferring the shapes of intermediate array values.
</p>
<br/>
<h3>Type-specialized Representation</h3>
<p>
When you call an untyped Parakeet function, it gets cloned for each distinct set of input types. The types of the other (non-input) variables are then inferred and the body of the function is rewritten to insert casts wherever necessary.
</p>
<!-- Parakeet typed -->
<script src="https://gist.github.com/4669615.js"></script>
<p>
In the case of <code>count_thresh</code>, observe that the function has been specialized for two inputs of type <code>array1(float64)</code> and <code>float64</code> and that its return type is known to be <code>int64</code>. Furthermore, the boolean intermediate value produced by checking if an element is less than the threshold is cast to <code>int64</code> before getting added to <code>n2</code>
</p>
<p>If you use a variable in a way that defeats <a href="http://en.wikipedia.org/wiki/Type_inference">type inference</a> (for example, by treating it sometimes as an array and other times as a scalar), then Parakeet gives up on your code and raises an error.</li>
</p>
<br/>
<h3>Optimize mercilessly!</h3>
<p>
Type specialization already gives us a big performance boost by enabling the use of an <a href="http://www.haskell.org/ghc/docs/6.12.3/html/users_guide/primitives.html#glasgow-unboxed">unboxed</a> representation for numbers. Adding two floats stored in registers is orders of magnitude faster than calling <a href="http://docs.python.org/2/reference/datamodel.html#object.__add__">__add__</a> on <a href="http://docs.python.org/2/c-api/float.html?highlight=float#PyFloatObject">PyFloatObjects</a>.
</p>
<p>
However, if all Parakeet did was specialize your code it would still be significantly slower than programming in a lower-level language. The compiler needs to exert more effort to contort and transform array-oriented Python code into the lean mean loops you would expect to get from a C compiler. Parakeet attacks sluggish code with the usual battery of standard optimizations, such as <a href="http://en.wikipedia.org/wiki/Constant_folding">constant propagation</a>,
<a href="http://en.wikipedia.org/wiki/Common_subexpression_elimination">common sub-expression elimination</a>, and <a href="http://en.wikipedia.org/wiki/Loop-invariant_code_motion">loop invariant code motion</a>. Furthermore, to mitigate the <a href="http://metamatix.org/~ocaml/price-of-abstraction.html">abstraction cost</a> of array expressions such as <code>0.5*vec1 + 0.5*vec2</code> Parakeet <a href="http://en.wikipedia.org/wiki/Loop_fusion">fuses</a> array operators, which then exposes further opportunities for optimization.
</p>
<p>
In this case, however, the computation is simple enough that only a few optimizations can meaningfully change it. I turned off <a href="http://en.wikipedia.org/wiki/Loop_unwinding">loop unrolling</a> for this post since it significantly expands the size of the produced code.
</p>
<!-- Parakeet lowered -->
<script src="https://gist.github.com/4669628.js"></script>
<p>
In addition to rewriting code for performance gain, Parakeet also "lowers" higher-level constructs such as tuples and arrays into more primitive concepts. Notice that the above code does not directly index into n-dimensional arrays, but rather explicitly computes offsets and indexes directly into an array's <a href="http://stackoverflow.com/questions/4355524/getting-data-from-ctypes-array-into-numpy">data pointer</a>.
Lowering complex language constructs simplifies the next stage of program transformation: the escape from Parakeet into LLVM.
</p>
<br/>
<h3>LLVM</h3>
<p>
<a href="http://llvm.org/">LLVM</a> is a delightfully well-engineered compiler toolkit which
that comes with its a powerful arsenal of optimizations and generates native code for a variety of platforms. To get LLVM to finish the job of compiling <code>count_thresh</code>, we need to translate into <a href="http://ujihisa.blogspot.com/2009/12/llvm-for-starters.html">assembly language</a>. Once the Parakeet representation has been typed, optimized, and stripped clean of abstractions, the <a href="https://github.com/iskandr/parakeet/blob/master/parakeet/llvm_backend.py">translation to LLVM</a> turns out to be surprisingly easy. Sure, there's some plumbing work to map between Parakeet's types and <a href="http://llvm.org/docs/LangRef.html#typesystem">LLVM's type system</a>, but that's probably the most straightforward part of this whole pipeline.
</p>
<!-- LLVM -->
<script src="https://gist.github.com/4669634.js"></script>
<br/>
<h3>Generated Assembly</h3>
<p>
Once we pass the torch to LLVM, Parakeet's job is mostly done. LLVM chisels the code we've given it with its bevy of optimizations passes. Once every last inefficiency has been ferreted out and exterminated, LLVM uses a platform-specific back-end to translate from its assembly language into native instructions. And thus, at last, we arrive at native code:
</p>
<!-- x86 -->
<script src="https://gist.github.com/4669647.js"></script>
<p>
Reading <a href="http://en.wikibooks.org/wiki/X86_Assembly/Basic_FAQ">x86-64 assembly</a> is tedious, so I won't expect you to make sense of this code dump. But do notice that we end up with the same number of machine instructions as we originally had Python bytecodes. It's safe to suspect that the performance might have somewhat improved.
</p>
<h3>How much faster is it?</h3>
<p>
In addition to benchmarking against the Python interpreter (an unfair comparison with a predictable outcome), let's also see Parakeet stacks up against an equivalent function implemented using NumPy primitives:
</p>
<script src="https://gist.github.com/4674879.js"></script>
<p>
I gave the the NumPy, Python, and Parakeet versions of <code>count_thresh</code> 1 million randomly generated inputs and averaged the time they took to complete over 5 runs.
</p>
<center>
<table border="1" padding="2" cellpadding="5">
<tr>
<th style="font-style:italic; background-color:#eee;">Python</th>
<th style="font-style:italic; background-color:#eee;">NumPy</th>
<th style="font-style:italic; background-color:#eee;">Parakeet</th>
<tr>
<td style="text-align:center;">3.7205</td>
<td style="text-align:center;">0.0036</td>
<td style="text-align:center;">0.0025</td>
</tr>
<tr>
<td colspan="3">
<small>
<i>
Execution time in seconds
</i>
</small>
</td>
</table>
</center>
<p>
Not bad — Parakeet is about about 1500 times faster than <a href="http://en.wikipedia.org/wiki/CPython">vanilla Python</a> and even manages to edge out NumPy by a safe margin. Still, that NumPy code is tantalizingly more compact than the explicit loop we've been working with throughout this post.</p>
<p> Can Parakeet compile something that looks more like the NumPy version of <code>count_thresh</code>? In fact yes, you can (and are encouraged to) write code in a <a href="http://www.vetta.org/2008/05/scipy-the-embarrassing-way-to-code/">high-level array-oriented</a> style with Parakeet. However, an explanation of how such code gets compiled (and parallelized) will have to wait until I discuss Parakeet's <i>data parallel operators</i> in the next post.
</p>
</div>Alex Rhttp://www.blogger.com/profile/04049149415019007603noreply@blogger.com0tag:blogger.com,1999:blog-4295204633719044289.post-58505129018947949012013-01-24T17:29:00.000-05:002013-10-01T15:13:37.546-04:00Just-in-time compilers for number crunching in Python<div dir="ltr" style="text-align: left;" trbidi="on">
Python is an extremely popular language for <a href="http://andy.terrel.us/blog/2012/09/27/starting-with-python/">number crunching</a> and data <a href="http://blog.wesmckinney.com/">analysis</a>. For someone who isn't familiar with the scientific Python ecosystem this might be surprising, since Python is actually <a href="http://andy.terrel.us/blog/2012/09/27/starting-with-python/">orders of magnitude slower</a> at simple numerical operations than most lower level languages. If you need to do some repetitive arithmetic on a large collection of numbers, then ideally those numbers would be stored <a href="http://stackoverflow.com/questions/12065774/why-does-cache-locality-matter-for-array-performance">contiguously</a> in memory, loaded into registers in small groups, and acted upon by a small set of machine instructions. The Python "interpreter" (<a href="http://tech.blog.aknin.name/category/my-projects/pythons-innards/">actually</a> a <a href="http://shape-of-code.coding-guidelines.com/2009/09/17/register-vs-stack-based-vms/">stack-based</a> virtual machine), however, uses a very bulky <a href="http://eli.thegreenplace.net/2012/04/16/python-object-creation-sequence/">object representation</a>. Furthermore, Python's dynamicism introduces a lot of indirection around simple operations like getting the value of a field or multiplying two numbers. Every time you run some innocuous looking code such as
<code style="background-color: #eeeef3;">x[i] = math.sqrt(y[i] * z.imag)</code>
a shocking host of dictionary look-ups, allocations, and all-around wasteful computations kick into gear.<br />
<br />
The trick, then, to getting good numerical performance from Python is to avoid <i>really</i> doing your work in Python. Instead, you should use Python's remarkable capacity as a <a href="http://docs.scipy.org/doc/numpy-dev/user/c-info.python-as-glue.html">glue language</a> to coordinate calls between highly optimized lower-level numerical libraries. This is why a certain <a href="http://astro.berkeley.edu/people/faculty/bloom.html">handsome astrophysicist</a> called Python "<a href="http://www.youtube.com/watch?v=mLuIB8aW2KA&feature=youtu.be">the engine of modern science</a>". <a href="http://www.numpy.org/">NumPy</a> plays an extremely important role in enabling this sticky style of programming by providing a high-level Pythonic interface to an <a href="http://www.haskell.org/haskellwiki/Arrays#Unboxed_arrays">unboxed array</a> that can be passed easily into precompiled C and Fortran libraries.<br />
<br />
In order to benefit from NumPy and its vast ecosystem, your algorithm must spend most of its time performing some common operation for which someone has already written an efficient library. Want to multiply some matrices? Great news, calling <a href="http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms">BLAS</a> from Python isn't really much slower than doing it from C. Need to perform a large convolution? No problem, just hop on over to the <a href="http://en.wikipedia.org/wiki/Convolution_theorem">frequency domain</a> with a call to the always zippy <a href="http://www.fftw.org/">FFTW</a>.<br />
<br />
But disaster and tribulation: What if no one has yet written a library that does the heavy lifting that I need? The standard solutions all boil down to "implement the bottleneck in C" (or if you're feeling enlightened, <a href="http://cython.org/">Cython</a>).<br />
<br />
Is a different way possible? Must we sacrifice all our abstractions to get performance? Even if we give up all the niceties of Python, we'll still probably churn out some fairly naive native code that woefully underutilizes our computers' capabilities. Think of all those pitifully empty vector registers, despondently idle extra cores, and a swarm of GPU shader units which haven't seen a general purpose computation in weeks. Harnessing all that parallelism from a low-level language requires, for most tasks, a heroic effort.<br />
<br />
A worthy challenge is then issued in two parts:<br />
<br />
<ul style="text-align: left;">
<li>Find a way to accelerate a meaningfully expressive subset of Python, such that it's possible to still use convenient abstractions without a large runtime cost. This generally implies a <a href="http://stackoverflow.com/questions/95635/what-does-a-just-in-time-jit-compiler-do">just-in-time</a> compiler of some sort (though a few <a href="http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.95.3786">notable</a> <a href="http://nuitka.net/">exceptions</a> do compile Python statically). </li>
<li>As long as we're dynamically translating high level abstractions into low-level executables, is there any chance that the "high level"-ness could be useful for parallelization? It sure would be nice to use those other cores...</li>
</ul>
<br />
To be clear, I am not talking about speeding up <i>all </i>of Python, though some very <a href="http://speed.pypy.org/">smart and praiseworthy folks</a> have been working on that for a while. Rather, the great thing about many numerically intensive algorithms is that they are remarkably simple. You might get away with using some subset of Python for implementing the core of your computation, and still feel like you are coding at a high level of abstraction (so long as the boundary between the numerical language subset and the rest of Python is mostly seamless).<br />
<br />
A surprisingly large number of projects have already risen to meet this challenge. They (roughly) fall onto a spectrum which trades off between the freedom of the compiler to dramatically rewrite/optimize your code and the expressiveness of the sub-language that is exposed to the programmer.<br />
<br />
<ul style="text-align: left;">
<li><a href="http://morepypy.blogspot.com/2012/11/numpy-status-update-5.html"><b>NumPyPy</b></a> - an attempt to reimplement all of NumPy in Python and then let PyPy do its <a href="http://libra.msra.cn/Publication/5910443/tracing-the-meta-level-pypy-s-tracing-jit-compiler">meta-tracing</a> magic. It seems to me that the all-or-nothing nature of PyPy's uncooperativeness with existing NumPy libraries makes this a Utopian misadventure in code duplication, requiring the reimplementation of a huge scientific computing code base with faint hope that a largely opaque general-purpose JIT can play the role of an optimizing compiler for scientific programs. Hopefully <a href="https://github.com/fijal">fijal</a> will prove us detractors wrong.</li>
<li><a href="http://numba.pydata.org/"><b>Numba</b></a> - one the several cool projects <a href="http://technicaldiscovery.blogspot.com/">Travis Oliphant</a> has been cooking up since he started <a href="http://continuum.io/">Continuum Analytics</a>. For the most part, Numba's main purpose is to unbox numeric values and make looping fast in Python. It's still a work in progress and seems to be going in multiple directions at once. They're adding support for general-purpose Python constructs, but relying on the traditional Python runtime to implement anything non-numeric, which sequentializes their runtime due to the <a href="http://stackoverflow.com/questions/265687/why-the-global-interpreter-lock">Global Interpreter Lock</a>. To enable parallelism you can disavow using any constructs that rely on things Numba doesn't compile directly...but that requires that you know what those constructs are. Like I said, it's still evolving. The <a href="https://store.continuum.io/cshop/numbapro">commercial version</a> of Numba even touts some capacity for targeting GPUs, but I haven't used it and don't know what can actually get parallelized. </li>
<li><a href="http://blaze.pydata.org/"><b>Blaze</b></a> - another Travis Oliphant creation, though this one is even more ambitious than Numba. Whereas NumPy is a good abstraction for dense in-memory arrays with varying layouts, Blaze is intended to work with more complex data types and <i>"is designed to handle out-of-core computations on large datasets that exceed the system memory capacity, as well as on distributed and streaming data"</i>. Travis is billing Blaze as the successor to NumPy. The underlying abstractions are to a large degree inspired by the Haskell library <a href="http://www.cse.unsw.edu.au/~chak/papers/LCKP12.html">Repa 3</a>, which is very cool and worth reading about. One key difference between Blaze and NumPy (aside from the much richer array type) is that Blaze delays array computations and then compiles them on-demand. I get the sense that Blaze is pretty far off from being ready for the masses, but I'm sure it will be Awesome Upon Arrival. </li>
<li><a href="http://code.google.com/p/copperhead/"><b>Copperhead</b></a> - Copperhead takes the direct route to parallelism by forcing you to write your code using <a href="http://www.cs.cmu.edu/~scandal/alg/scan.html">data parallel operators</a> which have clear compilation schemes onto multicore and GPU targets. To further simplify the compiler's job, Copperhead forces your code to be purely functional, which goes far against the grain of idiomatic Python. In exchange for these semantics handcuffs, you get some pretty speedy parallel programs. Unfortuantely, the author <a href="http://parlab.eecs.berkeley.edu/people/bryan-catanzaro">Bryan Catanzaro</a> has disappeared from github, so I'm not sure if Copperhead is still being developed. </li>
<li><a href="http://deeplearning.net/software/theano/"><b>Theano</b></a> - Theano is both more cumbersome and more honest than projects like Numba or Copperhead, which take code that <i>looks</i> like Python but then execute it under different assumptions/semantics. With Theano, on the other hand, you have to explicitly piece together symbolic expressions representing your computation. You're always aware that you're constructing Theano syntax explicitly. In exchange for your effort though, Theano can work small feats of magic. For example, Theano can group and reorganize matrix multiplications, reorder floating point operations for stability, and compute gradients using <a href="http://www.haskell.org/haskellwiki/Automatic_Differentiation">automatic differentiation</a>. Their backend has some preliminary support for CUDA and should eventually add in multi-core and SIMD code generation. </li>
</ul>
<div class="separator" style="clear: both; text-align: center;">
<img border="0" height="298" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0tVEWFI1pxXknQCzhn4Bsl2L_nSrtH7SquSoXKUHKw-nvgxazJIsLIlh8h1xvhST1PBsCh_f6a67ynKZDzFcEug2H09Py3BSGsOQfCagxHHvB9mDJ8Ia8kNJk6F7BupLqkif_HQWjYCcr/s640/Python+compiler+gradient.png" width="640" /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<i>Foot-in-mouth edit: I put PyPy all the way on the hopelessly sequential left of the diagram, just as they announced a new <a href="http://morepypy.blogspot.com/2013/01/numpypy-2013-developer-position.html">position</a> to parallelize and vectorize their JIT. Also, fijal justifiably <a href="http://www.reddit.com/r/Python/comments/17a4m6/the_landscape_of_justintime_compilers_for/c83wlkh">took offense</a> at my description of NumPyPy. I was in saying that NumPyPy is a whole-ecosystem rewrite, they're only going to rewrite the core and are still figuring out the right way to interact with native libraries.</i><br />
<br>
<hr>
<p>
To add another compiler-critter into the fray, I've written <a href="http://www.parakeetpython.com">Parakeet</a>, a just-in-time compiler for numerical Python which <a href="http://research.microsoft.com/apps/pubs/default.aspx?id=115734">specializes</a> functions for given input types. Parakeet makes extensive use of the data parallel operators such as <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.152">map, reduce,</a> and <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.152">(prefix) scan</a>. It's not essential to use these operators when programming with Parakeet, but they do enable parallelism and more aggressive optimizations. Luckily, it's quite easy to end up using these operators by accident, since our library functions are implemented on top of them.
</p>
<p>
<i>(edited to present Parakeet less sheepishly)</i>
<br>
On the spectrum described above, Parakeet sits somewhere between Numba and Copperhead. Like Copperhead, Parakeet's subset of Python is limited to using a small set of data types, library functions and data parallel operators. On the other hand, unlike Copperhead, you don't have to program in a purely function style: if you write loop-heavy numerical code you'll miss out on parallelization but will still see good single-core performance. The main difference from Numba is the absence of any sort of "object layer" which uses the Python C API. Parakeet will (in the long run) support a smaller more numerically-focused subset of Python for the purpose of giving the programmer a clear sense of what will run fast (and if a feature is slow, then Parakeet simply doesn't support it). Additionally, Parakeet's implementation of Python and NumPy library functions leans heavily on data parallel operators, which gives me hope for making pervasive use of GPUs and multi-core hardware.
</p>
<p>
If you want to learn more about Parakeet check out some of the following presentations:
<ul>
<li><i><a href="https://www.usenix.org/conference/hotpar12/parakeet-just-time-parallel-accelerator-python">HotPar 2012</a></i>: We submitted a paper describing an old version of our compiler (written in OCaml with a fragile GPU backend).
</li>
<li><i><a href="https://www.youtube.com/watch?v=ywHqIEv3xXg">SciPy 2013 Lightning Talk</a></i>: A 5-minute overview of the rewritten Parakeet with an LLVM backend.</li>
<li><i><a href="https://vimeo.com/73895275">PyData Boston 2013</a></i>: A longer presentation with more extensive comparison to Numba.</li>
</ul>
If you want to try using Parakeet, you can either install it via pip (<code>pip install parakeet</code>) or just clone the <a href="https://github.com/iskandr/parakeet">github repo</a>. Let me know how it goes!
</p>
</div>Alex Rhttp://www.blogger.com/profile/04049149415019007603noreply@blogger.com18