Parallel Haskell Digest 9

Eric Kow – Thursday, 19 April 2012

The Google Summer of Code is upon us and students have already submitted their proposals. There are a couple potential projects on concurrent data structures, which we'll have a look at below.

We will also be continuing our tour of Haskell concurrency abstractions with our word month, transaction. This digest is brought to you by the Parallel GHC project, an MSR-sponsored effort to push parallel Haskell technologies out into the real world. Check our project news below to see how we're doing in that front.

Finally, you may heard Functional Propaganda from a Simon or two. But how would the same message affect you if it came from a hardcore C++ hacker? If you haven't seen it making the rounds, have a quick look at Bartosz Milewski's The Downfall of Imperative Programming, and maybe tell your imperative friends? The FP monster is hatching from its academic egg; best be prepared!

News

Let's have a quick look at some of those GSoC proposals, particularly those with a parallel Haskell theme. It's all about performance this year. Two of the proposals involve using or improving parallellism in Haskell, and four are related to high-performance concurrency.

Parallel CSG engine

Constructive Solid Geometry (CSG) is the common approach to define complex bodies in engineering applications, ray tracing engines and physical simulators. Dmitry Dzhus' proposal is to deliver a fast parallel CSG engine using the Accelerate/Repa libraries for vectorised computations on GPU.

NUMA supporting features for GHC

Sajith Sasidharan wants to reach into the GHC RTS, with the aim of “extracting that last ounce of performance from NUMA systems, by firing all CPUs if/when necessary and by ensuring a suitably NUMA-aware memory allocation behavior.” Work in this area would benefit folks doing scientific computing, who may need great amounts of task and data parallelism.

Windows support for the new GHC I/O manager

The new I/O manager (GHC 7) brings great scalability improvements to Haskell — 10k simultaneous connections to your web server? no sweat. Unfortunately for Haskellers on Windows, these improvements are currently only available on Unix. Mikhail Glushenkov (who worked on GSoC last year to make improvements to Cabal) proposes to remedy this, making the new manager available on Windows. The challenge is that Windows I/O completion ports have slightly different semantics than their epoll/kqueue analogues on Unix.

Lock-free hash table and priority queue

More along the theme of high performance concurrency, Florian Hartwig's project aims to use GHC's new-ish atomic compare-and-swap (CAS) primitive to implement a high-performance lock-free hash table and a lock-free priority queue in Haskell. The CAS instruction is a bit of hardware support for concurrency: it compares a value in memory to an expected value, and iff they match, replaces it with a new value.

Implement Concurrent Hash-table / Hashmap

Loren Davis also proposes a thread-safe mutable hash table implementation in Haskell. Loren is still weighing some of the alternative approaches suggested in the Haskell community. He is currently leaning towards a lock-stripping approach as it would fulfill an immediate need in the community.

Concurrent Datastructures with good Performance

Mathias Bartl aims to implement two concurrent data types in Haskell, along with the usual battery of automated unit tests and benchmarks. The two that Mathias has in mind are a lock-free concurrent bag, and a concurrent priority queue.

Parallel GHC project update

We have been continuing our work to make ThreadScope more helpful and informative in tracking down your parallel and concurrent Haskell performance problems. We now have the ability to collect heap statistics from the GHC runtime system and present them in ThreadScope. These features will be available for users of a recent development GHC (7.5.x) or the eventual 7.6 release. In addition to heap statistics, we have been working on collecting information from hardware performance counters, more specifically adding support for Linux Perf Events. This could be useful for studying IO-heavy programs, the idea being to visualise system calls as being distinct from actual execution of Haskell code.

Speaking of performance, we are also continuing work on the new Cloud Haskell implementation (see Duncan Coutts' Fun in the Afternoon Talk), and have lately been focused on reducing message latency. This consists of work in three areas: improving binary serialisation, investigating the implications of using Chan and MVar to pass messages between threads, and perhaps improving the Haskell network library implementation to compete better with a direct C implementation.

Word of the month

Lately, we've been investigating the various ways Haskell helps us to get to grips with concurrency. We talked about how the MVar, the Haskell variant on locks, allows us to share mutable variables between threads, with some safeguards to help ensure consistency. MVar's may provide a nice high-level packaging around locks, but as we mentioned in the last digest, they can still go horrifically wrong, just like locks and synchronized methods in other languages.

We could go through the usual litany of reasons why locks are bad news, but maybe a healthier approach would be for us to focus on the positive. What do we want as programmers? One possibility is what Simon PJ (Beautiful Concurrency) calls “modular programming”, the ability to “[build] large programs by gluing together smaller programs”. Locks fall short of helping us to meet this desire. First, because the mere act of combining two locky programs may be inherently incorrect; withdraw acct1 amt >> deposit acct2 amt is bad because of the gap between the two actions where the money is in neither account. Second, because they seal off programs that we may otherwise like to moosh together; if process p1 waits for input on a pipe, process p2 waits for input on another pipe, how do wait for either of p1 or p2? So how do we wrestle back this modularity from our locky masters? And how do we make programming fun again?

Our word of the month today is “transaction”. Software transactional memory (STM) takes this idea of a transaction (a sequence of operations that can be treated as a single atomic block) from database design. The Haskell implementation of STM was introduced in the 2005 paper Composable Memory Transactions by Harris et. al. If programming fun is what you're after, this is a paper that comes with its own war-cry: “compositionality: a programmer can control atomicity and blocking behaviour in a modular way that respects abstraction barriers.”

Here are some quick highlights of the stm library. You may notice a couple of things, first that this library introduces its own notion of variable, the TVar (MVar,IVar; what, no pirate joke?) and second that STM involves a new monad of its own. Unlike the MVar that we saw in the last digest, TVar's do not have the same notion of being full or empty; they just hold values plain and simple. As for the STM monad, we will see why it matters when we first try to do some IO.

 -- Control.Concurrent.STM
 data STM a
 instance Monad STM
 
 atomically :: STM a -> IO a
 
 data TVar a
 newTVar   :: a -> STM (TVar a)
 readTVar  :: TVar a -> STM a
 writeTVar :: TVar a -> a -> STM ()
 
 retry  :: STM a
 orElse :: STM a -> STM a -> STM a

To get a rough idea how some of this is used, let's look at the transactional hello world, safely wiring money from one bank account to another. For the purposes of our example, a bank account is just a balance. To get some money from an account, we read the balance, subtract the amount, and write the new balance. Making a deposit is just withdrawing negative-money.

 type Account = TVar Int
 
 withdraw :: Account -> Int -> STM ()        
 withdraw acc amount = do
     bal <- readTVar acc
     writeTVar acc (bal - amount)
 
 deposit :: Account -> Int -> STM ()
 deposit acc amount = withdraw acc (- amount)

These primitive operations (withdraw and deposit) bring us to the question of modularity. How do we know that it's safe to combine these mini-programs into a bigger one? In other words, if we write something like withdraw from 42 >> deposit to 42, how do we avoid the possibility of running into some twilight zone state where the money is neither here nor there? If people do strange things like simultaneously transfering money in the other direction, will our program still work?

The answer lies in the distinction between STM (transactions) and IO (actions). So long as we remain in STM, we are simply assembling transactions, piecing smaller ones (“withdraw from a”) into larger ones (“withdraw from a and deposit it to b”), but not actually performing them! Having composed our transactions, we can use the function atomically to turn them into IO actions.

 -- still just a transaction
 transfer :: Account -> Account -> Int -> STM ()
 transfer from to amount = do
     deposit to amount
     withdraw from amount
 
 -- now we have an action!
 doTransfer :: Account -> Account -> Int -> IO ()
 doTransfer from to amount =
     atomically $ transfer from to amount

And atomically does what it says on the tin: it runs the transaction in a way that renders it indivisible, no twlight zones. Lest there is any confusion, even though the transaction is indivisible, we can still have concurrency during the course of the transaction, even simultaneously read the affected TVars if we want to. The indivisibility simply means that we never catch our transactions with their pants down. We neither read nonsense mid-transactional values (simultaneous reads would either get the before or after value), nor injecting values into a transaction mid-stream.

To get a feel for how these guarantees are possible, it could be useful to take a peek under the hood. For each transaction that is run, GHC maintains a thread-local log with an entry for each TVar accessed in that transaction. Each entry contains both the old value and the new value that would be committed if the transaction is succesful. This may be easier to see with a silly example:

main = do
    v1 <- atomically $ newTVar "Joe"
    v2 <- atomically $ newTVar "Bob"
    done <- atomically $ newTVar 0
    -- thread A (you can just pretend forkDelayIO == forkIO)
    forkDelayIO . atomically $ do
                              -- transaction log if A runs first
        x <- readTVar v1      -- v1: Joe -> Joe
        y <- readTVar v2      -- v1: Joe -> Joe, v2: Sue -> Sue 
        writeTVar v1 "Sue"    -- v1: Joe -> Sue
        writeTVar v2 x        -- v1: Joe -> Sue, v2: Bob -> Joe 
        writeTVar v1 y        -- v1: Joe -> Bob, v2: Bob -> Joe
        modifyTVar done (+1)  -- (stm 2.3 but easy to define)
    -- thread B 
    forkDelayIO . atomically $ do
                              -- (if A runs first)
        writeTVar v1 "Jean"   -- v1: Bob -> Jean
        writeTVar v2 "Paul"   -- v1: Bob -> Jean, v2: Joe -> Paul
        modifyTVar done (+1)
    waitThreads 2 done
    people <- atomically $ do -- (if A runs first)
        p1 <- readTVar v1     -- v1: Jean -> Jean
        p2 <- readTVar v2     -- v1: Jean -> Jean, v2: Paul -> Paul
        return (p1, p2)
    print people -- if A runs first, (Jean, Paul)
                 -- if B runs first, (Paul, Jean).

-- boring details just for this example
forkDelayIO job = forkIO $
    randomRIO (1, 1000000) >>= threadDelay >> job
waitThreads n v = atomically $
    do { d <- readTVar v;  when (d < n) retry }

In the above, we fork off two threads, A which swaps a pair of names and, B which overwrites them with other names. Atomicity here means that other threads never see any intermediary states and state changes from other threads don't affect the current thread. For example, thread B should never see v1 being set to "Sue". Likewise, if thread A should still read "Joe" from v1 even if B simultaneously writes "Jean".

This is made possible by validation of the transaction logs. Validation normally occurs at the end of a transaction (we won't cover the two other cases here: exceptions, and thread wake-ups). It consists of checking that all the expected “before” values for TVars still match reality. If the logs are good, we commit the new values; if not, we simply discard them and try the transaction again, taking the new reality into account. This validate-and-commit model allows us to run transactions simultaneously, safely, but with the occasional rollback and retry to ensure atomicity.

The notion of a transaction log brings us to the notion of cost. Good things don't always come cheap, and using a good thing like STM may require a little familiarity with the cost model behind it. Basically, it's important to keep in mind that the values we write to TVar's may come from some arbitrary expression, and that arbitrary expressions may be arbitrarily expensive. So being forced to retry transactions may involve redoing something expensive. If the transactions affect many variables, the chances of hitting a retry go up. Likewise, if the transaction takes a long time to run, the chance goes up of some other thread making a change that triggers a retry. In the pathological worst case, you can have some transactional behemoth that never manages to commit; because some smaller faster transaction keeps stealing its thunder. So keep an eye out for starvation and the more general problem for retries being expensive.

Cost may be a bit of a bummer, but there's also a Haskell-related silver lining behind all this. Because we have a purely functional language and the enforced separation between pure functions and side-effecting actions, STM is actually quite practical in Haskell. The number of things we need to track in a transaction log is limited to handful of explicit TVars rather that just about everything. If you are coming from other languages, you may have a memory of STM as being nice, but wildly impractical. Not so in Haskell. Eminently viable.

Aside from making STM practical, this sort of separation is also good for general peace of mind. Suppose for example that we coded up a feature in our banking software to send our users a text message alert whenever their balances fall below a threshold. If we were in the middle of a complicated transaction, we might be tempted to just slap that logic right in the middle of the transaction; however, the Haskell implementation makes this deliberately impossible. This can be a bit frustrating at first (and new Haskellers are sometimes faced with the “how do I get this out of the monad” puzzle), but saves us the greater danger of bombarding our users with spurious retry-induced text messages.

The guarantees that STM offer make it a great place to get started with Haskell concurrency. After all, why make software any buggier than it needs to be? If you do want to get started, have a look at Simon Peyton Jones' Beautiful Concurrency. It's a particularly good idea to do so, because there's some really interesting ground that we've not covered here (briefly, blocking, the retry function aborts the current transaction, and causes it to be retried when appropriate; and choice: a orElse b tries a, and if that should retry, then b, and if that should also retry, the whole expression again). Other great STM resources are Simon Marlow's tutorial on parallelism and concurrency and the Real World Haskell chapter on STM. With the four resources combined, you'll see a nice range of examples from the usual bank-account one to concurrently shuffling windows between desktops.

Blogs

The Downfall of Imperative Programming (9 Apr)

Take a hardcore C++ veteran with imperative programming in his bloodstream and loads of experience under his belt. Now give him a passion for concurrency and what do you get? First, you get a keen awareness that the future is massively multicore. Second, you get a hard-won appreciation for how difficult concurrent programming can be; for all jokes we make in the Haskell community about firing the missiles, the consequences of data races can sometimes be deadly. Third, you get the conviction that functional programming is the inevitable way forward.

Bartosz Milewski sums up the situation thus: Sooner or later you’ll have to face the multicore reality. You will be forced to learn functional methods to the extent to which your imperative language supports them. Despite that, data races will infest your code and leak into released products. So you might as well take a shortcut and embrace a full blown functional language now.

See what you make of his blog post if you have not done so already. There's quite a bit of buzz about this post, so you may also be interested in the programming reddit discussion around it as well.

Building A Concurrent Web Scraper With Haskell (10 Mar)

Let's make a concurrent web scraper! This blog post by Aditya Bhargava presents a hands-on introduction to both arrows (via hxt) and concurrency (via parallel-io). Aditya builds from the bottom-up, showing us little pieces of program that we might cobble together, culminating in a 52 line Haskell program that crawls web sites and fetches images within their pages. The parallel-io library used in this tutorial provides a thread pool which minimises contention by guaranteeing a limit to the number of unblocked threads running at the same time. It uses lock based concurrency with MVar's under the hood.

0MQ and Haskell (6 Mar)

Magnus Therning could not find any excuses to look into 0MQ. But “to hell with reason”, Magnus ended up deciding to just poke around without any specific goal in mind. He found a nice tutorial based on Python and translated its mini examples into Haskell. Magnus wonders why the API for subscribe is String rather than ByteString based. Also, he's finding that his client mysteriously dies after receiving a few messages. Any comments?

SIMD Support for the vector library (27 Mar)

Single instruction, multiple data (SIMD) is the sort of thing you might be interested in if you're into data parallelism: hardware that can perform the same instruction on multiple data simultaneously. Geoffrey Mainland posted about his efforts to bring SIMD support to GHC, the bigger picture being that you ought to be able to write nice high-level Haskell and have it work as fast the low-level Haskell or C that you might otherwise crank out. To try things out, Geoff benchmarks taking the dot product of two vectors in various Haskell and C versions. No happy ending yet, unfortunately: while the low-level Haskell version is competitive with C, the high-level is not. Check the post out for dissection of the results down the Core and assembly level. Hopefully better news in a follow-up posting.

Adding SIMD Support to Data Parallel Haskell (18 Apr)

Hopefully better news? Not as such, but perhaps something more interesting. In the previous post, we could make use of SIMD support by issuing some explicit instructions from the vector library. OK, but what about people who writing parallel code, say, by using a parallel arrays framework? And what if you could get this SIMD support virtually for free — no syntax attached? Geoff makes this possible by extending the Data Parallel Haskell framework so you would only have to tweak a single import statement, and exploitation of SIMD instructions would be automatic. See the posting for some nice benchmarks and also a brief introduction to Data Parallel Haskell.

Work Efficient Higher-Order Vectorisation (24 Mar)

Found on Manuel Chakravarty's tumblr: Our new draft paper on Work Efficient Higher-Order Vectorisation introduces a novel representation for nested, irregular parallel arrays that enables the work-efficient SIMD-ification of nested data parallelism — i.e., nested parallelism is transformed into flat parallelism, while maintaining the work complexity of a naive pointer-based representation of nested arrays. This solves a long standing problem that dates back to the original implementation of the language NESL.

Talks, tutorials, and packages

stm-chans 1.3.1 (1 Mar)

The stm-chans package offers a collection of channel types, similar to Control.Concurrent.STM.TChan but with additional features. This latest update by wren ng thornton takes advantage of optimisations in the newly released stm-2.3. It's highly recommended that all users bump their minimum stm-chans requirement to version 1.3.1

accelerate 0.10.1 (12 Apr)

Manuel Chakravarty has just released version 0.10.0.0 of Accelerate, an embedded language for GPU-accelerated array computations in Haskell that targets NVIDIA's CUDA framework and also has an experimental (and partial) OpenCL backend. A considerable amount of example code is in the companion package accelerate-examples. The main user-visible changes in this release are frontend bug fixes.

Parallel Functional Programming course

Students at Chalmers and Gothenburg University are currently 6 lectures into a course on parallel functional programming. The course has so far covered parallelism with par/pseq, Strategies and monad-par, using ThreadScope, and skeletons as a means to structure parallel computations. The course page has lecture notes and exercises which could be of interest even if you aren't currently following the course.

Cloud Haskell (Fun in the Afternoon)

Well-Typed's Duncan Coutts was at the recent Fun in the Afternoon, a termly seminar on Functional Programming in the UK). Duncan presented some of the motivation behind Cloud Haskell (”Erlang for Haskell”) and distributed programming, along with the Cloud Haskell design, and our work on a new implementation to follow on the initial protoype by Jeff Epstein. Our new implementation adds a swappable network transport layer. If you're happy with TCP/IP, don't wait for the new implementation; just cabal install remote and give Jeff's prototype a try.

Mailing lists

Concurrency

Transactional memory going mainstream with Intel Haswell (9 Feb)

Ben was wondering if any STM experts would comment on this recent Ars Technica article on the Intel Haswell chip. Austin Seipp pointed us to a comment by Duncan Coutts in the Reddit discussion (unfortunately not; the new extension would sledgehammer all instructions between the XBEGIN/XEND instructions). Ryan Ingram suggests that maybe the extension could be used to optimise the existing implementation, perhaps by wrapping transaction commits with XBEGIN/XEND)

Behavior of -threaded in GHC 7.4.1? (14 Feb)

Mike Craig just debugging a recent issue with GHC 7.4.1 and the zeromq3-haskell library, which provides a provides an FFI binding to libzmq. Unfortunately, code which used to run when compiled on GHC 7.0.4 dies with “operation on non-socket” when built with GHC 7.4.1. With the latter GHC, Mike can only run the code if he omits -threaded, or if he uses the -V0 flag to turn off the RTS clock and associated signals. After more debugging, he tracked the problem down to a addFinalizer on a Socket tuple. The finalizer was being run prematurely, perhaps because the Socket type was being optimised away. Putting the finalizer on the Ptr () in the tuple seems to solve the problem.

Question about concurrency, threads and GC (2 Mar)

Paul Graphov is trying to implement a networked application that supports bidirectional conversations, ie. not just request/response, but also sending notifications to clients. Paul is particularly interested in STM, but he's stuck on a bit of a design problem. His thinking so far is that he'll need to start 3 threads for each client, one to read data from the socket, one that sends queued messages to that socket, and one for the main behaviour loop.

Joey Adams noticed that this was the same sort of problem we reported in the previous digest. Joey was grappling with making making asynchronous I/O composable and safe. He wound up not using the stm-channelize package that he wrote, and recommends instead a 3-thread solution, using a thread each for receiving, sending, and coordination. Check out the small Haskell chat server that Joey wrote to illustrate the idea.

Alexander V Vershilov suggests a data-driven behaviour based on conduits and stm channels. He's also provided an example chat server, which you can compare against Joey's version. The two examples take a similar approach, and `could perhaps be combined to good effect.

"killThread" hangs! (ironic) (25 Feb)

Ryan Newton is gathering information in preparation for a possible bug report. He's testing the new network transport layer in distributed-process (Cloud Haskell) and gets hangs in killThread. Strangely, the pattern for hanging goes: GHC 6.12.3 [OK], 7.0.2 [HANGS], 7.2.1 [HANGS], 7.4.1 [OK]. Any ideas? Simon Marlow suggests it may be a bug in the RTS asychronous exception handling code, fixed with commit fa71e6c.

Synchronizations in memory allocation? (21 Mar)

Following up on the recent scaling bottleneck thread, Ryan Newton wondered: “What is the reason for GHC managing all this pinned memory for foreign pointers itself rather than using an external C malloc/free implementation and thus keeping disjoint Haskell and C heaps?” Simon Marlow says it's basically because GHC's mallocForeignPtrBytes is much faster than malloc()/free()

Ryan was asking because he is looking to how to do better on NUMA platforms. “We've got a NUMA-aware work-stealing scheduler now for monad-par, but it isn't really helping much yet. So we need to answer the question of how well our memory is being localized to socket-preferred physical addresses.” NUMA isn't something the GHC team have looked into for the RTS yet. He has some ideas for improvements to the block allocator; more details in the thread.

Haskell for BigData (16 Mar)

Andrei Varanovich observes that while Haskell has a lot to offer in the world of parallel/concurrent programming (from DPH to Cloud Haskell), it still lacks two important components for working with Big Data: integration with a distributed file system, such as Hadoop distributed file system, a data aggregration framework (eg. MapReduce, but of course something much richer; this being Haskell and all).

Andrei was interested in submitting a Google Summer of Code proposal to build a big data framework for Haskell on top of Cloud Haskell. I didn't see a proposal this year, but maybe next time? See the thread for technical suggestions, supportive comment, and pointers for a succesful Haskell GSoC project.

Parallelism

Help wanted! Parallelism causes space leaks (23 Mar)

Yavuz Yetim posted a small chunk of code using Strategies for parallelism. When he enables his parList rdeepseq strategy, though, he gets a stack overflow on smallish input (1 MB file), even if he allows GHC to use a 1GB stack. Switching to parMap, parListChunk and other strategies don't seem to help either.

Data.Array.Accelerate initialization timings …more (20 Feb)

Paul Sujkov is finding that array initialisation in Data.Array.Accelerate takes 10x the amount of time than either Data.Array and bare C++ CUDA array initialisation. Is there anything Paul might be doing wrong in particular? The accelerate package currently provides two backends, an interpreter (reference implementation) and a CUDA backend generating code for CUDA-capable NVDIA GPUs. Martin Dybdal comments that Paul should use Data.Array.Accelerate.use to generate hints to transfer arrays to GPU, and Data.Array.Accelerate.CUDA.run to actually perform the transfer. Manuel Chakravarty adds that the the fromList function is really just meant for testing, or for initialising small arrays. For anything bigger, going from vanilla lists is a bad idea, so have a look at Data.Array.Accelerate.IO.

Reasons for Super-Linear Speedup (5 Mar)

Burak Ekici has parallelized RSA decryption and encryption schemes by using second generation strategies. He's getting 10 times performance improvements… on a quad-core CPU (with an 8MB cache). Is this just mismeasurement, or are there some differences in how GHC handles serial/parallel of computation, say with respect to cache usage? Bardur Arantsson replies that the usual explanation for this sort of thing is that the working data suddenly fits within the per-CPU L2 cache when split up.

StackOverflow and Reddit

Help and Feedback

If you'd like to make an announcement in the next Haskell Parallel Digest, then get in touch with me, Eric Kow, at parallel@well-typed.com. Please feel free to leave any comments and feedback!