Concurrency in Python

: Snakes on a plain by Linda Frost

In an earlier post I described how the free lunch was over and how there was a renewed interest in concurrent and parallel programming.

So, what can we expect from Python? It’s a modern language, evolving rapidly under the watchful eye of Guido Van Rossum, the Benevolent Dictator For Life and has a package for everything including antigravity. The boast is that ‘batteries are included’.

So needless to say, Python has a module for threading. Called threading. There’s a lower-level thread module which has the raw access to platform-independent threads, but it lacks the full set of objects to make working with threads palatable. So the module has been deprecated in Python 3 and you’re encouraged to use threading instead, since it has the high level constructs like Locks and Semaphores.

Something I rather like about Python is how the with statement does automatic cleaning up for you, regardless of what exceptions are called to interrupt the block. (Java added a multicatch last year in version 7, but it only handles multiple exceptions, you still have to clean up yourself). Locks can be used in the same way:

import threading
rlock = threading.RLock()

with rlock:
    print "code that can only be executed while we acquire rlock"
    #lock is released at end of code block, regardless of exceptions

The code above will execute the print statement, or whatever code in its place and release the lock regardless of any exceptions thrown. That’s helpful for avoiding deadlock and recovering from problems.

One thing to remember with concurrency is that the old Python staple, print starts to get less useful for threads. You normally need to know what thread a statement came from, and if a program is complicated enough to need threads, it’s time for proper logging.

A note on importing modules: remember that code can be executed when you import a module. Normally this is sensible and a useful thing, but when threading is introduced it can cause some confusion. Spawning a thread during a module import is frowned upon, since you might cause a deadlock. Let the importer pick the threading model. Another general rule for safety is only non-daemon threads should import. Get another thread to set everything up before spawning a daemon thread.

Overall, Python has the usual threading idioms that you find in most modern languages. So long as you’re careful and do your coding with a large cup of coffee and your fingers crossed, you’ll be fine. But that’s not the Pythonic Way – rather, there are higher level tools that you can use to get your underlying job done, which I’ll come to later.

First we have to go through a cycle of surprise, doubt, fear, anger and acceptance regarding the Global Interpreter Lock, or GIL for short.

Let’s take a typical computation that can be broken up into pieces, and see how multiple cores can help. Looking at this simple single-threaded example SimpleMaclaurin.py, we take a simple Maclaurin series and sum it up to a very high degree of precision. The 1/(1-x) expansion converges very rapidly, but it’s just an example to keep the CPU busy. Let’s use C-Python 2.7.2 for this example.

How would we speed this up on my 8-core Intel i7? Obviously we’d break the problem into parallel pieces and let different cores hammer away. There’s no shared memory, just passing a simple float around for each parallel task’s results. Threads should speed it up linearly per CPU, right? A simple threading example is here, ThreadMaclaurin.py and you can use any number of threads. (For production code you would control access to the results array with a lock but I’m illustrating a different point here.) Using 1 thread matches the simple single-threaded example (4.522 secs). Then as you scale up the threads, the time changes:

1 thread     4.623 secs for 12800000 iterations
2 threads    6.195 secs for 12800000 iterations
4 threads    6.047 secs for 12800000 iterations
6 threads    6.357 secs for 12800000 iterations
8 threads    6.006 secs for 12800000 iterations

: by Ian Britton

What? You increase the number of threads and the time goes up? (In fact, the less loaded my machine, the more the time goes up with the thread count!) Looking at the task manager, multiple cores are in use, so what’s going on? The answer is the GIL problem has struck.

Python is an interpreted dynamic language. In a nutshell, the executed code can be changed at any time, so only one thread can run in the interpreter at once. Worse, there’s constant signalling and lock acquisition between the threads to see who gets the interpreter lock next. This chatter between threads is not necessarily fair (to use the concurrent term) so every thread gets starved apart from the one running. It depends on the operating system whether locks are fair or not. Effectively, using multiple cores has re-serialised your problem, with a lot of lock acquisition overhead added for good measure.

So, what happens if I use Jython? Jython is an implementation of Python that runs on the Java Virtual Machine. Essentially it compiles down to Java bytecode – it’s not interpreted, to there’s no Global Interpreter Lock. Here are the results if you run the same ThreadMaclaurin.py code on my machine with Jython 2.5.2:

1 thread     5.855 secs for 12800000 iterations
2 threads    2.836 secs for 12800000 iterations
4 threads    1.581 secs for 12800000 iterations
6 threads    1.323 secs for 12800000 iterations
8 threads    1.139 secs for 12800000 iterations

That’s more like the scaling that we expect. A little slower for one thread, perhaps due to a slower JVM startup time, but blazing after that.

You may ask: why not use Jython all the time, to avoid this problem? Well, truth is Jython’s support lag’s behind mainstream Python. Many modules aren’t available for it or are added later and it has tended to be a bit behind in handling later language features. But it’s perfect for interacting with Java programs, even if it doesn’t have modules like multiprocessing.

Bear in mind that these tests are run through my PyCharm IDE and you’ll no doubt get different timings on your machine but you get the general idea. If you add threads to a computationally-intensive task and the overall speed goes down, start to suspect an interaction with the GIL. It will not slow down a problem that is purely IO intensive. So many of the normal benefits of concurrency are not lost, such as subjectively smoother response for a user, better management of network or disk IO.

A great talk by David Beazly on the GIL is here where he really manages to convey the issue nicely and put forward some solutions for the mainstream. He has slides here for that talk, but I’d recommend watching as he’s a good speaker.

Before you start getting too disgusted with Python, let me assure you – this isn’t a big problem, just something to be aware of. If your program slows down when you add cores, you can simply jump to the multiprocessing module. Here we get separate processes instead of threads, and since every process gets its own GIL, no problem! Inter-process communication is dead easy with pickle, including passing arguments. You can use Queues to pass messages very easily between processes. Doug Hellman has a useful tutorial on multiprocessing, and several that follow on from the basics. Obviously there is an overhead to multiple processes, but unless there are huge amounts of memory involved, it’s barely noticeable for a computationally intensive task.

Looking at our Maclaurin series problem we get these numbers using the multiprocessing approach, MultiprocessMaclaurin.py

1 thread     4.561 secs for 12800000 iterations
2 threads    2.339 secs for 12800000 iterations
4 threads    1.464 secs for 12800000 iterations
6 threads    1.201 secs for 12800000 iterations
8 threads    1.120 secs for 12800000 iterations

Now, that’s better – proper scaling with the number of cores, and we get to use all the modules of C-Python. We can distribute these processes and easily pass messages between them with queues or named pipes. As we will see, there are even higher-level ways of dealing with our problem, building on this.

Do be careful when spawning processes from a script! The child process needs to be able to import the script or module containing the target function, which means if you code up the target in your current script then it will recursively spawn off processes until your machine locks up. For this reason I recommend saving all your work and shutting down your other applications for the first session that you explore the multiprocessing package in Python. The ways to avoid recursive behaviour are:

1. Have the target method in another module/script

2. Protect the executed code with a test for __main__:

if __name__ == '__main__':
   p = multiprocessing.Process(target=worker, args=(i,))
   p.start()

3. Use a properly object-oriented structure in your code, which is what you would do for production code anyway.

Feel free to ignore this advice and after you’ve rebooted your machine you’ll definitely remember the point!

So, multiprocessing gets past the GIL, which is great. But still, it’s not quite the Pythonic Way. Far too many fiddly details when we want to be operating at a higher level and only dropping down the layers when we need to. What we really need is a module that simply parallelises tasks for us.

There are several packages that address the problem outright – including parallel python (pp). So here is our Maclaurin expansion implemented using pp: ParallelMaclaurin.py – it uses the logging API so we get control of its output, which is nice. Not only will this package run and use the CPUs on the local machine, you can farm the tasks out to other servers very easily. That changes how we think about parallelising the problem – notice how the other examples look at number of threads or processes; in this implementation we break the problem up into tasks and let the module use all the CPUs by default. I chose 64 tasks, not for any particularly smart reason but you would normally tune it to the available cluster. We lose any complications about passing results back using a queue – our function returns the result so each job returns the result. We start opening up the problem seamlessly into symmetric versus asymmetric parallel processing – we start looking at the problem at a higher level.

For completeness, here is the time taken using parallel python on just a local machine:

Time taken 1.050000 secs for 12800000 iterations

: Photo by Justin Guyer

Now, Parallel Python can feel like a job management system rather than a concurrency solution, but that’s ok. It gives you stats back for jobs, lets you set up templates for tasks, gives you control over the number of CPUs involved. You can start thinking in terms of Big Data interactions and functional programming, map/reduce or whatever the high-level problem space is. Given that the main motivation set out at the beginning of this rather long post is to get our free lunch back, to get processing power moving again with less risk and complexity, then Parallel Python fits the bill nicely. But no risk of lock-in, there are plenty of other high level packages too, a good list of them is here. I haven’t even got into Stackless Python, which aims to solve concurrency in yet another way. But that’s a topic for another post.

To sum up – Python offers something for concurrency and parallel processing at every layer of abstraction. Threads, threading, multiprocessing with inter-process communication, clustering or any number of higher level approaches. Start at the top and stop when you’ve solved your problem!

Links:

http://pyinsci.blogspot.com/2007/09/parallel-processing-in-cpython.html – useful application of much of this post, with more numeric examples.

A talk on how threading is unleashing hell by David Beazley in Chicago – very amusing.

Bruce Eckel on PP and its creator and concurrency in Python using Twisted (an asynchronous module)

One Response to Concurrency in Python

Pages

Recent Posts

Recent Comments

Archives

Categories

Meta