Parallel Haskell Digest 8

Eric Kow – Friday, 02 March 2012

It's time for our next catch-up with the Parallel Haskell community. Did you have a chance to see Simon Peyton Jones's talk The Future is Parallel, and the Future of Parallel is Declarative? It's a good survey of some of the directions that parallel Haskell has been taking, and if you're somewhat new to this stuff, a great feel for the breadth of the community. You'll get a better idea why you see people in this digest muttering about arrays, others trying to make channels and transactions work together, and yet others talking up the 0MQ protocol. So check it out!

We at Well-Typed are writing the digest as part of our mission to push Haskell parallelism and concurrency out into the real world. We are very excited about these technologies and we want to do whatever it takes to make it more accessible to everybody. More news below on how we're doing in this Parallel GHC project.

News

GHC 7.4.1 released

Of particular interest to parallel Haskellers: better profiling flags, multicore profiling, vastly improved DPH, event logging, more convenient RTS flags (no more -rtsopts needed for common options like -N). See that new event logging stuff? Could be a good chance to play with the new spark profiling features from ThreadScope.

Summer School on Functional Programming for Parallel and Concurrent Applications (2 weeks)

Imagine this: two summer weeks in the Southern France on the Côte d'Azur, learning about Parallel Haskell. This would be the CEA-EDF-INRIA Summer School on Functional Programming for Parallel and Concurrent Applications. It takes place on 11-22 June 2012, Castle of Cadarache, Saint Paul Lez Durance, France.

No functional programming experience needed. The aim of the summer school is to give a thorough and application-oriented introduction to functional programming using the programming language Haskell. A special focus is on parallel and concurrent programming, highlighting the ways in which features such as strong typing and purity make it dramatically easier to write reliable parallel or concurrent code. The school is split into three different courses that highlight different aspects of functional programming. All courses consist of lectures and hands-on sessions where everyone can try out the language on several exercises.

Well-Typed's Andres Löh will be there, as well as Ralf Hinze, and Simon Marlow. The registration deadline is on 30 May, but I wouldn't procrastinate if I were you; the course sounds really tempting!

Jobs

PhD program at Portland State accepting applications for Fall 2012 (18 Jan)

There's quite a bit of functional programming research at PSU, and one of the professors (Andrew Black) is a co-author on the Cloud Haskell paper!

6 Funded Research Studentships at St Andrews

The School of Computer Science at the University of St Andrews has six prize PhD studentships available. These studentships are funded through the Scottish Informatics and Computer Science Alliance and provide both fees and maintenance (£13,500) per annum. They can be taken up by students of any nationality (including non-EU students). The deadline is today (1 March), so if you're interesting, it might not be too late to get in touch with Kevin Hammond.

Computer Science Lecturer Job at Kent

The School of Computing at the University of Kent is seeking to appoint a high flying academic who will spend the first 3 years in post developing as an independent researcher of internationally excellent quality in an area relevant to the School of Computing. This does not look Parallel Haskell specific, or even FP specific, but Simon Thompson did tweet that “the functional programmers at Kent would be particularly keen to recruit someone who would work with them!“

Parallel GHC Project Update

ThreadScope 0.2.1 has been released! This version contains the features we had demonstrated at the Haskell Implementor's Workshop in September 2011. Since our workshop prototype, we have greatly refined the spark histogram feature, tuning the visualisations so that they are easier to understand. We've also written a small tutorial to go with the new release. The ThreadScope Tour works through concrete examples on using ThreadScope to debug the performance of parallel programs. We'd love feedback you have about the tutorial, especially things you feel like you need a little more help with.

Along with the new ThreadScope and tutorial, we also have a new version of the ghc-events package which now provides an encoding of the meanings of events in state machines. This makes it possible to validate eventlogs, and doubles as an always up-to-date source of code as documentation.

We've made some progress on our work in developing a swappable transport layer for Cloud Haskell. We now have a prototype implementation “distributed-process” (intended to be the sucessor to “remote”, the current Clound Haskell implementation). For more details, see the distributed-process GitHub page, particularly the examples and the design document, which incorporates feedback on our initial proposal.

Finally a bit of partner news to wrap things up:

Tim (LANL) has been fine-tuning and profiling their physics simulation tool. One change in particular has been to switch to a “counter-based” RNG scheme which lends itself nicely to use in parallel or distributed settings. They are also currently having a look at Haskell-MPI and Cloud Haskell.

Finlay (Dragonfly) is working on a tricky performance problem, trying to speed up a program with multiple relatively small computations and relatively many synchronization points in between.

Toni (Telefonica) is benchmarking the maximal clique enumeration algorithm on a large number of randomly generated graphs. He is getting good speedups for up to 6 cores, with some stagnation around 8 cores. Current research is either finding a graph traversal strategy that works reasonably well with all graphs, or have one which somehow adapts to the properties of the graph.

Kazu (IIJ) has made several new releases of his libraries, and bug-fixes for a lot of third party libraries. He has also been participating in the discussion that lead to the development of Michael Snoyman's conduit library, and switched Mighttpd over to good effect.

Mark (VETT) is learning how to use Cloud Haskell. He is currently writing some example programs to test his understanding. Hopefully more from Mark when he's had more time to play with this.

Word of the month

Over the next few digests, we'll be switching our focus from parallelism to concurrency. We tend to stress the distinction because Haskell offers ways to write parallel programs without making explicit use of concurrency. Parallelism done right gets us faster programs. Concurrency on the other hand buys us… concurrency. It's not going away. If every multicore computer in existence were to vanish, we would want to solve concurrent problems. Whether the simultaneity is real or simulated, we would still want to do more than one thing at the same time – accept user input, display progress messages, serve multiple clients.

So let's dig in! We first got a taste of concurrency in second Parallel Haskell digest, where we introduced the notion of threads. As an abstraction, threads give us a way to express the separation of concerns between different jobs. But this isn't enough. Sometimes we need to undo the separation just enough to pass information from one thread to another.

This brings us to our word of the month: MVar. The humble MVar (pronounced “em-var”) is one of many solutions for this communication problem, a fairly low-level one by Haskell standards, but one that is still useful enough that you'll see it used very frequently. An MVar is like a burri… wait, wrong tutorial. Actually, it is helpful to think of an MVar as a box in the sense that it holds values and can either be full or empty. The MVar type takes a type variable, so an MVar Int might hold an integer , an MVar String a String, an MVar [String] a list of strings and so on.

   -- Control.Concurrent.MVar
   data MVar a
   newEmptyMVar :: IO (MVar a)
   takeMVar :: MVar a -> IO a
   putMVar  :: MVar a -> a -> IO ()

To give an idea how this might be used, below is a small program that fetches some URL in one thread while doing something else in the other. We fork off a Haskell thread that does the fetching and write to MVar to indicate what we've retrieved. In the main thread, we do our other work and then just wait until the page has been fetched.

 main = do
    m <- newEmptyMVar
    
    forkIO $ do
      r <- getURL "http://en.wikipedia.org/wiki/Shovel"
      putMVar m r
    
    doSomethingElse
    
    r <- takeMVar m
    putStr r

These MVar's may look a little familiar if you've used IORefs in Haskell. Here is a mini API for comparison:

   -- Data.IORef
   data IORef a
   newIORef   :: IO (IORef a)
   readIORef  :: IORef a -> IO a
   writeIORef :: IORef a -> a -> IO ()

So what exactly do MVar's buy us? Why not just use IORefs to share mutable variable across threads? The reason is that coordination between threads can get messy: we want to make sure that any value we pass from one thread to another is accounted for (and not accidentally overwritten before being consumed), that we don't try to consume values that are somehow out of date with respect to other threads (that updated values are received instead of an old value being read twice). Suppose we wanted to fetch a URL while doing something else at the same time. How do we know when we have successfully retrieved it?

 -- don't write this at home!
 inadvisableMain = do
    m <- newIORef "" -- default value? :-(
    
    forkIO $ do
      r <- getURL "http://en.wikipedia.org/wiki/Shovel"
      writeIORef m r -- are we overwriting something? :-(
    
    doSomethingElse
     	 	
    r <- readIORef m -- is there something to read? :-(
    putStr r

In the example above, we have no idea if the page at URL would have been fetched by the time we try to display its contents. What we are looking for is a synchronisation mechanism. We need a way to indicate that our shared values are ready to be used. For example, we might hit upon the idea of combining IORef with Maybe. Now we have a some extra bureaucracy to handle. If we read a value and get Nothing we would then know that there isn't yet a value ready to be read. One way or another we would have to account for this case, for example busy waiting until we get a Just. On the flip side, we want to make sure that when somebody has written a value, the intended recipient has read it before we write over it. This sort of bureaucracy would be best packaged up into helper functions, functions that look awfully like takeMVar and putMVar might. Notice the change in name even. Now we're not just reading, but taking, emptying the box to signal that it's been read; and we're not just writing, but putting, only writing when the box is empty. Throw in a little help from the runtime system so that we're doing something smarter than busy waiting and we'll have worked our way back to the MVar.

So the MVar combines references with locking to provide for synchronisation between threads. If you're coming from other languages, this should sound rather familiar. C programmers may have used mutexes (MVar ()) or semaphores (MVar Int) to protect shared data. Java programmers may have used synchronized methods and statements to prevent thread interference and memory inconsistency problems. The MVar is just a slightly nicer Haskell packaging to the same sort of idea. This means it suffers the same problems as its locky sisters. Sure, having the locks be implicit and putting them where they count (the data being shared) makes life a bit simpler, but at the end of the day locks are still locks. Take them in the wrong order, and you can still get deadlocks, races and all those subtle hard-to-reproduce bugs that keep programmers up at night.

What is the hapless foot-shooting programmer to do? The good news is that MVar's are only one of several mechanisms for dealing with concurrency. Safer mechanisms exist, albeit at a cost. MVar's present a compromise between performance and safety. If you are extra extra careful, you can on the one hand squeeze some serious performance out of atomicallyModifyIORef for concurrent data structures. If on the other hand, if you're willing to take a potential performance penalty in exchange for never worrying about a deadlock again, stay tuned for our next word of the month, “transaction”. For more about MVar's in the meantime, have a look at Edward Z. Yang's blog for an MVar overview as well as the updated API documentation, and finally the concurrency chapter from Real World Haskell.

Videos

It's Raining Haskell (30 minutes)

Charles from MSDN's Channel 9 had a chance to catch Simon Peyton Jones and John Hughes for a chat at the YOW conference in Sydney. As John observed during the interview, “Haskell is picking up more and more real users now, and that's very pleasing, but one of the reasons is because it offers a way of getting to grips with parallelism without suffering all the difficulties that you do with imperative programming languages, so I think that's a big opportunity for Haskell. I'd hope it won't just be a source of ideas for other languages, but that it will actually see a lot of real use itself.”

The Future is Parallel, and the Future of Parallel is Declarative (1 hour)

Simon wasn't at YOW just to give interviews on park benches. During his presentation, Simon gives us a sense of what a parallel future might look like. He skips past the usual FP-for-multicore propaganda, jumping straight to the prediction that “in the end, something declarative will win”. The bad news is that “there is no parallelism without tears”. If we want parallelism, we need to develop parallel algorithms, and understand the cost models behind the approaches we use. But one cost model does not fit all problems, so “no silver bullet; embrace diversity”. Simon then presents some of the different paradigms on offer, using Haskell as a single linguistic framework, but not specifically pushing the language. He covers task parallelism (threads, STM, Cloud Haskell), semi-implicit algorithm (par and friends) and data parallelism (Repa, DPH).

It's worth knowing that the YOW conference is “for developers by developers” and Simon has given the talk for a non-Haskell audience. So if you really just want an idea what all the parallel fuss is about, give his talk a try!

Real-time ray tracing demo with Repa

As posted by Ben Lippmeier: Real-time ray tracing demo written in Haskell with a pre-release of the Gloss-Field and Repa 3.0.0 parallel array libraries. Demo is running in real time in parallel on a four core machine with two hardware threads per core.

Barnes-Hut Gravity Simulation

More demos? OK! How about this gravity simulation using Gloss? A naive n-body simulation has quadratic runtime complexity, O(n²); but with the Barnes-Hut algorithm, we can approximate one with only an O(n log n) runtime. These simulations lend themselves well to parallelism; however, at present (2012-02), the sequential version of this program (using Data.Vector) still runs faster than its parallel equivalent (using DPH). The good news is after working on DPH vectorisation for 6 months, the asymptotic complexity of the DPH version now matches that of the sequential one. Hooray! Well OK, so there's still pesky constant factor involved. The sequential version is still 10 times faster, but if Ben can improve DPH enough to make the parallel version faster by August, he's buying everybody beer. See his blog post for more details, as well as Manuel Chakravarty's slides on vectorisation and the shared data problem.

Blogs and packages

Making OpenCL™ Simple with Haskell (Jan 2011)

Benedict R. Gaster published a set of slides (53 slides) presenting the results of his collaboration with Garrett Morris (AMD intenern and PSU PhD student). “It's a heterogeneous world” out there, with a modern platform including some CPUs, GPUs, and more. OpenCL is a framework for writing parallel programs on this mixture of devices. Benedict's slides introduce OpenCL, and talk about the low-level Haskell FFI to OpenCL C, and the use of monads and Template Haskell quasiquotation so that we can write something higher-level and more Haskelly. See the slides also for some comments about Haskell at AMD.

Propane: Functional synthesis of images and animations in Haskell

Following up on his quasicrystal code, Keegan McAllister has released Propane, a library for functional synthesis of images and animations.

Propane uses Repa for data-parallel array computations. That means it automatically uses multiple CPU cores for rendering, provided the program is compiled and run with threads enabled. That said, it's not yet been optimized for speed in other ways.

Black-Box Reverse Engineering ZMQ

ØMQ (ZeroMQ) and the ZMTP protocol provide a high performance messaging layer for concurrent/distributed applications. Unfortunately, for Xavier Lange, “the ZMQ spec is not fully cooked”. This he discovered when trying to use libzmq and find tha the data sent by it doesn't quite match the specification. So what's a ZMQ implementor to do? In this blog post, Xavier inserts a packet sniffer between his Haskell code and a pure C libzmq test app and teases apart the “real” protocol as implemented by the library. He'll be studying the (LGPL) code next to see if he got the right idea from his black box debugging efforts.

stm-channelize 0.1.1 (20 Jan)

This package arose from Joey Adam's frustrations with asynchronous IO in Haskell for a networking layer he was writing. The package turn I/O operations into STM transactions on channels.

stm-conduit-0.2.1 (9 Feb)

Clark Gaebel invites us to check out his stm-conduit package, which “introduces conduits to the wonderful world of concurrency”:

My package solves the common problem of constant bottleneck switching loaders have. This is when, for example, we stream XML from the disk and then parse the XML in one conduit pipeline. While it streams a file from the disk, the process is clearly IO bound, and while it parses the XML, the process is CPU bound. By putting each task on its own thread, the disk IO doesn't need to wait for the CPU to parse a document before loading the next file. By using stm-based conduits, we have full resource utilization.

Mailing lists

Parallelism

Help wanted! Optimizations and parallel execution in the IO for a small spellchecker (15 Feb)

Paul Sujkov wrote a simple spellchecker that checks for words that aren't in a dictionary file. He's used it to benchmark the various Haskell hash table implementations on hackage, and now wants to see if he can take things a step further with parallelisation. He's tried using monad-par and the parallel packages both, but seems to have performance trouble due to the IO part of the problem. Any tips for Paul?

SMP parallelism increasing GC time dramatically (9 Jan)

Mikolaj Konarski followed up on a discussion started by Tom Thorne a while ago. Tom was nagged (in stereo!) to ThreadScope his performance problems, which he would have done had ThreadScope not crashed. Ben Gamari thinks he may have experienced something similar. What about you? Has ThreadScope ever crashed on you?

Data Parallel Haskell on MacOS X (and GHC 7.4) (15-21 Jan, 9 Feb)

Mukesh Tiwari has been a series of problems getting dph to work on MacOS X Lion. He has gotten messages like “Couldn't figure out LLVM version” and “Error running clang!” and more recently, the fact that -fdph-par is no longer recognised by GHC.

Some solutions (Brandon Allbery and Carter Schonwald) seem to be to learn that the LLVM that comes with your Mac is not a full distribution, and that you would need to install one (eg. brew install llvm for Homebrew users).

As for GHC 7.4, the folks at UNSW have not had a chance to update the DPH wiki. In the meantime, try Ben Lippmeier's suggestion to use -package dph-lifted-vseg to select the backend, and also look at the dph-examples cabal file to see what flags to use when compiling.

Pruned Sparks in Original Strategies (17 Jan)

Burak Ekici is trying to parallelise a sampling of programs to see how first and second generation Strategies compare. In his understanding, the garbage collection mechanism in this older version had an issue which prevented sparks from being pruned. Yet he finds sparks being pruned anyway with the original version. Any reasons why this might be the case?

Speculative Parallelism (30 Jan)

Burak was also thinking about the idea that second generation strategies can exploit speculative parallelism where the original could not. How can he convince himself of this? Does it indicate speculative parallelism when all and the same number of created sparks are converted by both strategies?

Evaluating parallel computations in order of finishing (approximately) (6 Feb)

Victor Miller wants a way to evaluate a list of computations in parallel, returning them in (roughly) the order of completion. Does something of that sort exist? It does! Ryan Newton added that this is a common non-deterministic merge operation used in stream-processing frameworks. You might have an IO action like parCompletionOrder :: [a] -> IO [a], but one which returns immediately allowing the list to be lazily consumed. The function Control.Concurrent.Chan.getChanContents should be helpful in implementing such a function.

Merging two sorted list in parallel ( dph ) (29 Jan)

Mukesh Tiwari has two sorted lists of triples that he wants to merge in parallel. The triples have the form (i,j,val), and are sorted by their j and i elements. If both lists have an element with the same j/i, we take the sum of their val. Mukesh has implemented half of a merging algorithm involving binary search, but he's stuck on updating his DPH arrays. Is there a function that does something like the below?
```
updateWithFun :: (a -> a -> a) -> [:(Int,a):] -> [:a:] -> [:a:]
```

repa nested sequential warning (27 Jan)

Rick got some warnings in Repa about “performing nested parallel computation sequentially“ with the recommendation to use the force function. See the thread for some specific questions you might have about Repa and forcing arrays. The overall answer from Ben Lippmeier is to use Repa 3 if possible, which should resolve the problems Rick was facing and also provide some less confusing terminology (eg. renaming force to computeP)
how to write this with Control.Concurrent.Strategies ? (24 Jan)

Johannes Waldmann would like to be able to say to his students “mergesort already is an inherently parallel algorithm” and "you can execute it as such - you just need to annotate the source text". He has an implementation written in the par/pseq style but wants to know the “right” way to express it with Strategies. Patrick Maier posted an attempt which gets similar speedups to Johannes' original implementation. For comparison, Simon Marlow added a version using monad-par. Same problem, three techniques! This posting could be useful if you're just getting started learning about parallelism.

Also, Chris Brown observed that merge sort is heavily sequential at the merge point and does not parallelise very well. But Johannes is only really looking for mild parallelism, enough to keep 4 to 8 cores busy. For people wanting to get a lot of paralellism out of merge sort, Chris Brown suggests a technique using balanced binary trees, and Ryan Newton a parallel divide-and-conquer (during merge phase), as used in the cilkmerge function in the Cilk sorting code.

Concurrency

STM: "nested atomically" error (12 Jan)

Johan Brinch is getting a “Control.Concurrent.STM.atomically was nested” error and can't understand where it's coming from. There is some unsafe IO in there, but only in the form of debug messages. These innocuous-looking debugs may be the culprit, according to Andrew Coppin and Brandon Allberry. Their respective guesses are that (1) something to be evaluated that wouldn't be otherwise or (2) that something is being lazily evaluated as usual, but perhaps being forced within a transaction. Brandon adds in particular that the GHC IO system uses STM internally, which might explain nested transactions in the presence of unsafePerformIO. These sorts of shenanigans are normally prevented by the type system, but all bets are off once you go unsafe.

STM atomic blocks in IO functions (13 Jan)

Rob Stewart would like to know if he would still be guaranteed thread safety if he had some separate atomically transactions in an IO action, as opposed to one big block. Bryan O'Sullivan replied that “[i]f you want successive operations to see a consistent state, they must occur in the same atomically block.”

This confused Ketil Malde. Isn't the idea for state to be consistent on entry/exit of each block, that breaking a program into multiple transactions is fine so long as each transaction is semantically complete unit? Steffen Schuldenzucker offered a clarification, that the “consistent state” Bryan was talking about is being sure that no other thread has modified a value during the span of the block.

Rob had a follow-up question: is it possible to perform IO within an STM block? In a word, no. Daniel Waterworth explains that it would not make sense to allow this, because an STM transaction may be retried, which is typically not what you want to happen to your IO actions. If it's not possible to do with out, TwilightSTM or some sort of locking solution with a TMVar might be options. But usually you can factor the IO out so that it is not necessary to do within transactions.

How to make asynchronous I/O composable and safe? (14 Jan)

Joey Adams is “not happy with asynchronous I/O in Haskell. It's hard to reason about, and doesn't compose well”. He was grappling with several issues simultaneously (hopefully fodder for the next revision of Simon Marlow's Concurrent Haskell tutorial).

The setup is that Joey is writing a networking layer an application. Writing a simple client/server to test this layer turned out to be horrible on the server side (“an abomination”). The main problem seems to be that he is forced to use two threads per client (sending and receiving), because GHC does not provide a way to simultaneously wait for input and wait for an STM transaction to succeed. Issues that arise from this are
- synchronizing the threads (e.g. when one dies, kill the other one)
- multiple places where an exception can arise (which can lead to problems like exceptions interrupting atomic operations)
- sharing a connection handle between multiple threads (you may recognise this problem if you've seen garbled/interleaved output from your concurrent code, eg. “AnOonteh esre nsteenntceen.c e.”
Some of the discussion that came up went over using channels to ensure the use of whole messages only (avoiding the issue with exceptions) and help “separate [the] application thread from the complexities of connection management“ (Bardur Arantsson, Daniel Waterworth); the observation that a pair of input/output channels is essential a stream processor arrow (Peter Simons); the possibility of using 0MQ to handle all the nasty details and get a simple message-based interface; some general thoughts on dealing with IO (Peter Simons thinks that one central IO loop plus callbacks is better than spreading IO all over programs); and the prospect of a wait-free concurrency model based on vats (David Barbour); and the avoidance of asynchronous exceptions except as a last resort (David, Joey).

Joey eventually followed up by writing an stm-channelize package that creates an STM layer over a network connection.

Exceeding OS limits for simultaneous socket connections (30 Jan)

Rob Stewart has a dozen machines sending messages to each other in a master/slave setup. The machines are running a Haskell program that sends thousands of messages (lazy bytestrings) to a master node in short succession. He sometimes gets an “accept: resource exhausted (Too many open files)” message because the master node exceeds its limit for simultaneous socket connections. Is there a good way to deal with this problem? Surely, it's something the Haskell web frameworks have had to contend with?

Marc Weber suggests (A) to limit the number of simultaneous connections accepted by controlling the number of threads that can run simultaneously (see his mail for a sketch how), and (B) to use strict IO on the handle and close it himself, which Donn Cave points out will only reduce the size of the problem but not make it go away entirely. Best to stick with option (A).

Memory usage with TVar? (1 Feb)

Johan Brinch gets a surprising amount of memory usage from a tiny program that just cons'es a bunch of numbers onto a TVar [(Int,Int)]. Where is it all coming from? Jake MacArthur points us to a StackOverflow posting on the memory footprint of Haskell data types, apparently nothing STM specific so much as the result of having lots of cons cells, boxed Ints, and tuples.

Concurrency strategy for 2 threads and rare events (8 Feb)

JP Moresmau is looking for a good strategy to deal with a setup where he has one worker thread, and another user input thread ocassionally telling things to the worker thread. JP supposes he could just try a tryTakeMVar, but the use of MVar's seems a bit heavyweight to him, as the second thread is just blocking for user input most of the time. What's a good setup? Erik Hesselink suggests maybe a throwTo. Yves Parès wondered what JP means by overkill; MVars are both “teeny tiny” and convenient (they are!), perfect for the job. Edward Amsden suggested an IORef modified by atomicModifyIORef, cheaper than an MVar and usable when you don't have to lock multiple threads.

How to increase performance using concurrency for sequential producer-consumer problem (13 Feb)

Roel van Dijk sketches out a program for us: producer produces values and feeds them to a callback function (blocking until there are values to produce), converter does CPU-only work (and is the bottleneck of the program), consumer consumes values (important: in the order they were produced). A non-concurrent driver for these may look like producer (consumer . converter). OK now what if Roel wants to make use of concurrency, ideally spawing off a worker thread for each core in his system? The important thing here is that the consumer continues to get values in the order they come out of the producer.

Some suggestions: conduits and the newly released stm-conduit package (Clark Gamari), monad-coroutine (Mario Blažević), bounded STM channels (John Lato). John points out that while something like iteratee/conduit-stm would give you concurrency between the producer/converter/consumer, but would not directly help with the concurrency between converters.

New distributed-process (Cloud Haskell) backend/transport design and prototype

Duncan Coutts followed up on his design proposal for the Cloud Haskell transport design. As mentioned in the Parallel GHC news above, we've been working on doing some prototyping, and now have code for people to look at and play with. Check out the GitHub page and design document for distributed-process.

Round-robin scheduling and forkIO (2 Feb)

Andreas Voellmy wanted to check what happens when you forkIO a new thread, if it it executed in the same HEC as the original or in some new thread. He points to Runtime support for multicore Haskell, which says that threads within the run queue of a single HEC are scheduled in round-robin order; but does not talk about the threads as a whole. This can lead, Andreas suggests, to unfair scheduling of threads if you have the situation where an HEC has a single always-runnable thread, and another HEC has lots of threads (see his follow-up for example code illustrating the problem). Simon Marlow both that threads are executed in the current HEC and that the described situation can happen (“It wouldn't be too hard to do something more sensible here”). Simon suggests looking in rts/Schedule.c, particularly the function schedulePushWork ().

Synchronizations in memory allocation? (11 Feb)

Andreas has a program with several Haskell threads each reading a different TCP sockets. He's investigating a problem where the performance of his program starts to degrade after 20 cores. Andreas suspects a scaling problem in allocating memory and has written a micro-benchmark to explore the issue. The program forks some worker threads that just allocated some memory without trying to synchronise or share any data. He is surprised to find that the benchmark takes longer using mallocPlainForeignPtrBytes the more threads he forks. The benchmark makes for a good stress test — it allocates over 10 Gb/s — and as result of Andreas' investigations, Simon Marlow has modified the pinned memory allocator to “steal complete blocks from the nursery rather than allocating them from the global block allocator.” The changes will show up in GHC 7.4.2.

20 cores, eh? There's no pleasing some people.

StackOverflow and Reddit

Help and Feedback

If you'd like to make an announcement in the next Haskell Parallel Digest, then get in touch with me, Eric Kow, at parallel@well-typed.com. Please feel free to leave any comments and feedback!