Parallel Haskell Digest 4
Friday, 22 July 2011, by Nicolas Wu.
Filed under parallel, phdigest.
Hello Haskellers!
It's time for the fourth edition of the Parallel Haskell Digest, bringing you a summary of the news and discussions of parallelism and concurrency in Haskell. The digest is made possible by the Parallel GHC Project.
News
The Monad.Reader: special issue on parallelism and concurrency
In the words of The Monad Reader itself:
Whether you're an established academic or have only just started learning Haskell, if you have something to say, please consider writing an article for The Monad.Reader! Issue 19 will be a special issue focusing on articles related to parallelism and concurrency, construed broadly. The submission deadline for Issue 19 will be: Tuesday, August 16.
Threadscope Implementor's Summit
The threadscope implementor's summit was held this month at Microsoft Research, Cambridge. The summit brought together developers who are currently working with Threadscope, whether that be hacking on generating the events that are emitted by GHC for analysis in Threadscope, using the event trace that is produced for detailed profiling information, or working on improving Threadscope itself to provide better tools for parallel profile analysis.
The meeting was full of ideas, and covered topics such as: adding extensions to the current eventlog format to enable additional information to be tagged to events; improving the visualisation of information in threadscope; formalising the transitions of thread states into a finite state machine; and matching up executed code with corresponding source locations. With all this food for thought, we should expect plenty of interesting work in this area.
Word of the Month
In this issue we have two words of the month: par and pseq.
Haskell provides two annotations, par
and pseq
, that allow the programmer to give hints to the compiler about where there are opportunities to exploit parallelism. While these annotations are not typically used directly by programmers, it is useful to understand them because they are the underlying mechanism for higher level tools such as "parallel strategies" and the Eval monad.
The two combinators have the following signatures:
par :: a > b > b pseq :: a > b > b
While their signatures are the same, they are used to annotate different things. The par
combinator hints to the Haskell implementation that it might be beneficial to evaluate the first argument in parallel. However, since Haskell does not impose an evaluation order, we also need pseq
, which instructs the compiler to ensure that its first argument is evaluated before the second.
Let's take a look at an example inspired by Parallel Performance Tuning for Haskell by Jones, Marlow, and Singh, which illustrates this nicely. Suppose you're interested in the sum of two expensive computations. To keep things simple, we'll use a naive implementation of fib
(the point here isn't to have an efficient computation, I'm trying to show an expensive one):
fib :: Int > Int fib 0 = 0 fib 1 = 1 fib n = fib (n1) + fib (n2)
For a second expensive computation, we'll calculate the negafibonacci number, which works on negative numbers:
negafib :: Int > Int negafib 0 = 0 negafib (1) = 1 negafib n = nfib (n+2)  nfib (n+1)
The sum of these two can be calculated by the following sequential function:
sumfib :: Int > Int sumfib n = x + y where x = fib n y = negafib (n)
There's obvious room for improvement here when we have two cores: we simply calculate the expensive computations on separate cores. Annotating the code above is a fairly simple process. We first use par
to annotate the fact that x
must be calculated in parallel with the rest of the computation. Second, we ensure that y
gets computed before x + y
by annotating with pseq
. The result is as follows:
import Control.Parallel (par, pseq) psumfib :: Int > Int psumfib n = x `par` (y `pseq` x + y) where x = fib n y = negafib (n)
We can write a simple program that outputs the result of running this computation with the following main
function:
main :: IO () main = putStrLn . show . sumfib $ 37
We should hope for the parallel version to work twice as fast, since the two expensive functions should take about the same time to compute. Here's the output of compiling and running the sequential version of the program:
$ ghc rtsopts Main.hs [1 of 1] Compiling Main ( Main.hs, Main.o ) Linking Main ... $ time ./Main real 0m6.113s user 0m6.090s sys 0m0.010s
Now replacing sumfib
with psumfib
produces the following results:
$ ghc rtsopts threaded Main.hs [1 of 1] Compiling Main ( Main.hs, Main.o ) Linking Main ... $ time ./Main +RTS N2 real 0m3.402s user 0m6.660s sys 0m0.040s
This is obviously a very trivial example, but the point is that annotations provide a powerful way of expressing parallel algorithms. It's interesting to note that for this simple program, the timings for the parallel version on a single core performs as well as the single core version compiled without threading.
While annotations are a simple mechanism for expressing where parallelism might be exploited in a program, beware that there are a number of pitfalls to using this technique: all that glitters is not gold! The main difficulty in using par
and pseq
directly is that you really need to have a clear understanding of evaluation order. In particular, unless you understand what laziness is doing to the evaluation order, then you might find that the computations you're sparking off with par
might not occur when you expected they should.
Then there are all the general difficulties that you face with parallel programming, like getting the right granularity of work to do in parallel. Using par
is quite lightweight so can be used to exploit reasonably fine grained parallelism, but it is certainly not free. Finally, parallel performance suffers when there are too many garbage collections, so keeping this to a minimum by either using more efficient data structures or increasing available memory, becomes an important factor.
Nevertheless, it's well worth having a play with par
and pseq
. The next step after that is to look at parallel strategies. Strategies is a layer of abstraction on top of par
and pseq
. You might like to to read Seq no more: Better Strategies for Parallel Haskell, by Simon Marlow et al. which describes Strategies and the Eval monad. It's all available in the parallel library on Hackage. More recently, the Par
monad has also been introduced as yet another way of describing parallel evaluations. These key topics will no doubt feature in a future word of the month, so stay tuned!
Parallel GHC Project Update
The Parallel GHC Project is an MSRfunded project to push the realworld use of parallel Haskell. The aim is to demonstrate that parallel Haskell can be employed successfully in industrial projects. This month we're having a guest column from the team over at Los Alamos National Laboratory, one of the partners involved in the project (you can see the full details in report LAUR 110341). They have been working on writing Monte Carlo physics simulations in Haskell, which has given them high levels of parallelism, along with useful tools for abstraction. So, without further ado, over to Michael Buksas from LANL:
Our goal is to build highly efficient Monte Carlo physics simulations using parallel Haskell. We're focusing on SMP performance though some combination of explicit threading and pure parallel annotations.
The Monte Carlo approach involves randomly sampling the space of solutions to generate data which contributes to the solution. For these physical problems, our samples are the tracks of particles as they move through space, interacting with a physical material as they go. Data collected from each particle trajectory is then combined into information needed to compute the solution. For example, the detailed information about the particle's interaction with the material is collected into a collective effect on the material properties.
To date, we have a code base which includes two approaches to the problem. One is a specific and paralleltuned application code targeting relativistic neutrino transport in stellar atmospheres. The other is building a more general environment for creating specific applications, such as this one.
We recently presented to our colleagues in LANL some preliminary results on the parallel performance of the targeted application code.
To give a sense of the approach to parallelization in this code, consider these highlevel functions from an earlier serial version:
main :: IO () main = do (n, rest) < parseCL let tally = runMany infMesh simpleMat n writeTally "tally" tally runMany :: Mesh > Material > Word32 > RNG > Tally runMany msh mat ntot rng = let ps = genParticles ntot msh rng tallies = map (runParticle msh mat) $ ps in foldl' merge emptyTally tallies
And consider the following changes for the parallel version:
main :: IO () main = do (n,sz) < parseCL let tally = feed infMesh simpleMat n sz prand writeTally "tally" tally feed :: Mesh > Material > Word32 > Word32 > RNG > Tally feed msh mat ntot chunkSz rng  ntot <= chunkSz = runMany msh mat ntot rng  otherwise = t `par` (ts `pseq` (merge t ts)) where t = runMany msh mat chunkSz g1 ts = feed msh mat (ntot  chunkSz) chunkSz g2 (g1,g2) = split g
We've wrapped function runMany
in feed
, which partitions the collection of generated particles into groups of size chunkSz
, and issues these particles to runMany
in parallel.
With this simple change, we seeing upwards of 80% utilization of up to 8 cores, for a performance improvement greater than a factor of 6. We believe that performance can be further improved with different strategies for breaking down the work, and looking for additional parallelization opportunities in the collection of results.
Our other branch of development is focused on finding useful abstractions and highlevel functions to support programming a variety of Monte Carlo problems of this kind. We have identified a few such useful abstractions, and implemented them as type classes and type families.
For example, Space
is a general term for the physical space and imposed symmetries in which we can perform a simulation. We express this as follows:
class Space s where type Position s :: * type Direction s :: * stream :: s > Distance > s position :: s > Position s direction :: s > Direction s make :: Position s > Direction s > s
and implement specific spaces, such as one with the symmetry of the unit sphere:
instance Space Spherical1D where type Position Spherical1D = Radius type Direction Spherical1D = Normalized Vector2 stream (Vector2 x y) (Distance d) = Vector2 (x+d) y position s = Radius $ vmag s direction s = normalize s make (Radius pos) dir = pos * (normalized_value dir)
This allows the specific space data types to be used in a variety of contexts. Using ordinary parametric polymorphism is also effective:
  Stream a single particle: stream :: (p > (e,p))  ^ Function to produce each step. Comes from a model. > (e > Bool)  ^ Check for terminal events to stop streaming > p  ^ Initial particle > [(e, p)]  ^ Resulting list of events and particle states. stream stepper continue p = next p where next p = let (e, p') = stepper p in (e, p') : if continue e then next p' else []
The above is our highlevel routine function for generating a history from a single particle, recorded as a list of (event, particle) pairs, where the event and particle data types are provided for each problem.
Blogs, Papers, and Packages
Fun with parallel monad comprehensions (19 July 2011)
This blog post by Thomas Petricek featured in the Monad Reader 18, and covers some of the interesting things that can be achieved with monad comprehensions when viewed from a parallel perspective. Along the way, he deals with examples such as the parallel composition of parsers.
Parallelizing a nonogram solver (05 July 2011)
Jasper Van der Jeugt detailed his implementation of a parallel nonogram solver. Nonograms also go by the name of Paint Sudoku: the aim is to colour in a grid where a list of numbers is given for each row and column and these numbers indicate consecutive runs of filledin squares in the corresponding row or column. For large puzzles, grids that are 20x20, Jasper reports that on a dual core machine his a parallel algorithm reduces execution by 37.9% compared to its sequential counterpart.

Edward Z. Yang has written a series of posts discussing IVars, which are immutable variables which are a writeonce, readmany (these are particularly handy for communicating results from a child process to its parent). Edward's post outlines the difficulties involved in defining a monad for IVars.
Map reduce as a monad (05 July 2011)
Julian Porter wrote an article for The Monad Reader 18 about how MapReduce could be expressed as a monad. The MapReduce framework finds its roots in functional programming, and this is an interesting take on the problem.
Mailing list discussions
NVIDIA's CUDA and Haskell (5 July 2011)
Vasili Galchin was wondering whether or not there had been any efforts to build bridges between NVIDIA's CUDA and Haskell. Don Stewart was quick to respond with a number of links to active work in the area:
Trevor McDonell noted that the accelerate package was best accessed from the source repository on github, and that the CUDA bindings hadn't yet been tested or updated for the latest toolkit release.
Unbelievable parallel speedup (3 June 2011)
While reading Simon Marlow's tutorial on parallel and concurrent programming, John Ramsdell reported some remarkable (slightly superlinear!) performance gains for one of his programs. Thomas Schilling guessed that this was due to the large variance in the figures reported, but went on to describe how it might be possible to obvserve such performance boosts due to reduced local cache misses when using several cores. Without more information about the program in question, it's difficult to do any kind of diagnosis, but nevertheless, it's great to hear about good results from a happy Haskeller!
Automatic Reference Counting (2 July 2011)
After hearing about the new static analysis tools in Clang that does automatic reference counting (ARC), Thomas Davie was wondering if some compiler gurus might be able to comment on the applicability of this kind of analysis to Haskell, as an alternative to garbage collection. This led to an enlightening discussion about reference counting versus garbage collection.
Haskell on NUMA (16 June 2011)
Michael Lesniak was wondering what the state of parallel performance of Haskell on NonUniform Memory Access (NUMA) machines was like, since he's having problems and can't find much useful information online. Nobody seems to have answered this one, are there any suggestions?
Parallel compilation and execution? (26 May 2011)
Michael Rice was trying to figure out how to compile and run a simple program that outputs the result of a parallel fibonacci algorithm. After a quick reminder to use
pseq
rather thanseq
to force sequential evaluation, Daniel Fischer suggested that recompilation might be required, and that passingfforcerecomp
would be a good way to ensure that this occured.Michael was also keen to know whether Control.Parallel was comparable to OpenMP. Alex Mason gave a detailed reply and gave an example of parallel mergesort as a means of comparison.
parMap doesn't work fine (12 May 2011)
After just starting out with parallel computations in Haskell, Grigory Sarnitskiy ran into troubles making parMap work with lazy structures. To resolve these issues, Brandon Moore pointed to using
rdeepseq
, and Maciej Piechotka suggesteddeepseq
.efficient parallel foldMap for lists/sequences (17 June 2011)
Sebastian Fischer reposted his question about efficient parallel
foldMap
for lists to the parallel mailing list. In essence he was seeking an efficient implementation offoldMap
, where a list is folded into a single value before a map is applied to the result. Johannes Waldmann advised against using ordinary lists, and mentioned that he was usingData.Vector
instead. Additionally, he recommended switching to a sequential fold once a parallel fold had been used to a certain depth. Christopher Brown further confirmed that it was a good idea to spark off computations when the granularity is high enough to make it worthwhile, and also mentioned that it was best to spark computations that were evaluated to normal form.Wanted: parallel Haskell tutorial/talk/demonstration in Leipzig, Germany, October 7 (8 July 2011)
Johannes Waldmann is looking for volunteers who might be able to present at their local Haskell Workshop, and welcomes submissions on parallel and distributed computing using Haskell. The submission deadline is 20 August.
Stack Overflow
 How to write nested loop problem using parallel strategies in Haskell
 How to measure sequential and parallel runtimes of Haskell program
 Poor performance / lockup with STM
Help and Feedback
If you'd like to make an announcement in the next Haskell Parallel Digest, then get in touch with me, Nicolas Wu, at parallel@welltyped.com. Please feel free to leave any comments and feedback!