Increasing parallelism in a frequency counter

Hi everyone, I've started playing with some parallel programming in Haskell, mostly by applying the first few examples from Simon Marlow's book to my own toy problem: https://github.com/apauley/parallel-frequency I'm quite pleased with the speedup I have so far, the fastest parallel implementation is just over twice as fast as its sequential counterpart. But looking at the threadscope images from the generated event logs I can see that there are still significant portions of my program that runs on only one core. Is anyone interested to see if we can increase the parallelism in this frequency counter? Kind regards, Andreas -- https://keybase.io/apauley http://za.linkedin.com/in/apauley http://www.meetup.com/lambda-luminaries http://www.meetup.com/Bitcoins-Baby-Bitcoins GPG Public Key: https://keybase.io/apauley/key.asc

Hi Andreas, all, On 21/05/15 22:08, Andreas Pauley wrote:
Hi everyone,
I've started playing with some parallel programming in Haskell, mostly by applying the first few examples from Simon Marlow's book to my own toy problem: https://github.com/apauley/parallel-frequency
Nice.
I'm quite pleased with the speedup I have so far, the fastest parallel implementation is just over twice as fast as its sequential counterpart.
But looking at the threadscope images from the generated event logs I can see that there are still significant portions of my program that runs on only one core.
Is anyone interested to see if we can increase the parallelism in this frequency counter?
If the problem is specified as taking the N most frequently occuring items in a list: f :: Ord a => Int -> [a] -> [(a, Int)] then I cheated, I assume the input is pre-chunked into a sensible number of pieces: f' :: Ord a => Int -> [[a]] -> [(a, Int)] with which it is much easier to get good parallel speedups. Anyway, some general optimisation points, not all directly related to the parallel part of the problem: 0. profile to see what is taking time (and space, it increases GC time) you can use +RTS -h without having to recompile for profiling 1. lists are memory inefficient (eg: Map keys): replace String with Text 2. lists and Text are bad at random access (eg: splitting at an index) 3. use Data.Map.Strict to avoid building up (+) thunks 4. (take n . sort) is possibly taking a lot of sequential time (see 0) 5. perhaps try parallel reductions instead of linear folds: a * b * c * d * ... -> ((a * b) * (c * d)) * ... though the overhead didn't seem worth it in my tests 6. not optimisation related, but consider writing a defaultMain then each executable source can be reduce to import DefaultMain (defaultMain) import FrequencyImplementation (frequency) main :: IO () main = defaultMain frequency you could even have all the implementations with the same module name but in different directories, and use hs-src-dirs: to select, then you would be able to use an identical Main module. more on 2: A bit of a low-level hack, this: I read the whole input file as a ByteString and split it into an appropriate number of chunks, taking care not to split UTF-8 sequences, also taking care not to split mid-word. Each output chunk references the original input without copying the data. Then I decodeUtf8 each ByteString chunk to a Text. This reduced peak memory usage (as reported by +RTS -h graphed with hp2ps) from ~400MB of (:) to ~30MB of ARR_WORDS for the 11MB artamene.txt. more on 4: So you have a large Map k Int containing frequencies, and you want the largest n frequency counts. There is an O(1) Map.splitRoot, which you can use to parallel divide and conquer - the largest n of the whole map are going to be among the (n * split count) combined largest n of each of the split maps. This didn't give me much speed up at all, though. I put my implementation at http://code.mathr.co.uk/word-histogram Thanks, Claude -- http://mathr.co.uk

Hi Claude,
On 23 May 2015 at 17:12, Claude Heiland-Allen
Anyway, some general optimisation points, not all directly related to the parallel part of the problem:
Thank you so much for taking the time to implement your own version, and to explain everything you did. Your slowest sequential version is about 3 times faster than my fastest parallel version, I'm really impressed :-) Thank you also for the general coding tips, I'll definitely do some cleanup with a defaultMain or similar. I am busy looking at your implementation together with the optimisation points you mentioned. There is a lot for me to absorb, and I'll probably learn more from comparing your code (and the thinking behind it) with mine than I would have just reading a book or an article. Is there a field in computer science similar to comparative literature? Because I find comparing different implementations of the same program very useful as a learning tool. Kind regards, Andreas -- https://keybase.io/apauley http://za.linkedin.com/in/apauley http://www.meetup.com/lambda-luminaries http://www.meetup.com/Bitcoins-Baby-Bitcoins GPG Public Key: https://keybase.io/apauley/key.asc

On 24 May 2015, at 22:07, Andreas Pauley
wrote: I am busy looking at your implementation together with the optimisation points you mentioned. There is a lot for me to absorb, and I'll probably learn more from comparing your code (and the thinking behind it) with mine than I would have just reading a book or an article.
Is there a field in computer science similar to comparative literature? Because I find comparing different implementations of the same program very useful as a learning tool.
Look at exercism.io - it supports Haskell Andrew Butterfield School of Computer Science & Statistics Trinity College Dublin 2, Ireland
participants (3)
-
Andreas Pauley
-
Andrew Butterfield
-
Claude Heiland-Allen