Weird multi-threading runtime behaviour of single-threaded program with GHC-7.0.3

Hello, I'm experiencing an unexpected behaviour with GHC-7.0.3/x86_64's multithreading runtime when running some more demanding single-thread computations. I've isolated the part of the larger application which exposes the problem down to the parsing of a ~8MiB JSON file which causes quite a bit of memory allocations (source at end of email). The program is compiled with ghc --make -O2 -threaded -eventlog -rtsopts parse_aeson.hs and I ran it with and w/o the -N12 option on a 12-core machine (the problem is visible on a 4-core machine as well): $ time ./parse_aeson test.json 0.000392s after readFile 0.001016s parse was a success... 6.050836s NFed parse-tree... 6.050968s NFed parse-tree... no exception real 0m6.100s user 0m5.680s sys 0m0.419s $ time ./parse_aeson test.json +RTS -N12 0.001249s after readFile 0.001777s parse was a success... 10.048607s NFed parse-tree... 10.048738s NFed parse-tree... no exception real 0m10.125s user 0m57.496s sys 0m39.927s When observing top(1)'s output, I saw that all HECs consumed a substantial amount of cpu cycle, so I loaded up threadscope to see what the remaining 11 HECs were doing all the time. I've put the resulting eventlog files and screenshots of threadscope's visualization online at http://www.narf.at/~hvr/Haskell/parse_aeson.N0.eventlog.png http://www.narf.at/~hvr/Haskell/parse_aeson.N0.eventlog.xz http://www.narf.at/~hvr/Haskell/parse_aeson.N12.eventlog.png http://www.narf.at/~hvr/Haskell/parse_aeson.N12.eventlog.xz What's happening there? The actual processing work seems to be done in a single HEC... but what are the remaining 11 HECs doing exactly? Am I doing something wrong? What does it mean when the GC-bars in threadscope becomes green as in the middle of the screenshot below? http://www.narf.at/~hvr/Haskell/parse_aeson.N4.eventlog.pdf ------------------------------------------------------------------------- -- also available at http://www.narf.at/~hvr/Haskell/parse_aeson.hs module Main where import Control.DeepSeq import Control.Exception import Data.Aeson import Data.Time import System.Environment import qualified Data.Attoparsec.Lazy as APL import qualified Data.ByteString.Lazy.Char8 as LB parseIt :: FilePath -> IO () parseIt fn = do t0 <- getCurrentTime let putStrLn' !s = do t0' <- getCurrentTime putStrLn $ show (t0' `diffUTCTime` t0) ++ " " ++ s json_text <- LB.readFile fn putStrLn' "after readFile" let Success p = decode json_text :: Result Value putStrLn' "parse was a success..." putStrLn' $ p `deepseq` "NFed parse-tree..." putStrLn' $ p `deepseq` "NFed parse-tree..." decode :: FromJSON a => LB.ByteString -> Result a decode = either Error fromJSON . APL.eitherResult . APL.parse json main :: IO () main = do fns <- getArgs case fns of [fn] -> do res <- try $ parseIt fn case res of Left e -> putStrLn $ "caught: " ++ show (e :: SomeException) Right () -> putStrLn "no exception" _ -> putStrLn "usage: parse_aeson <json-file>"

I found this email thanks to the Parallel Haskell Digest, thanks Eric & Nick! On 01/04/2011 11:37, Herbert Valerio Riedel wrote:
I'm experiencing an unexpected behaviour with GHC-7.0.3/x86_64's multithreading runtime when running some more demanding single-thread computations.
I've isolated the part of the larger application which exposes the problem down to the parsing of a ~8MiB JSON file which causes quite a bit of memory allocations (source at end of email). The program is compiled with
ghc --make -O2 -threaded -eventlog -rtsopts parse_aeson.hs
and I ran it with and w/o the -N12 option on a 12-core machine (the problem is visible on a 4-core machine as well):
$ time ./parse_aeson test.json 0.000392s after readFile 0.001016s parse was a success... 6.050836s NFed parse-tree... 6.050968s NFed parse-tree... no exception
real 0m6.100s user 0m5.680s sys 0m0.419s
$ time ./parse_aeson test.json +RTS -N12 0.001249s after readFile 0.001777s parse was a success... 10.048607s NFed parse-tree... 10.048738s NFed parse-tree... no exception
real 0m10.125s user 0m57.496s sys 0m39.927s
When observing top(1)'s output, I saw that all HECs consumed a substantial amount of cpu cycle, so I loaded up threadscope to see what the remaining 11 HECs were doing all the time.
I've put the resulting eventlog files and screenshots of threadscope's visualization online at
http://www.narf.at/~hvr/Haskell/parse_aeson.N0.eventlog.png http://www.narf.at/~hvr/Haskell/parse_aeson.N0.eventlog.xz
http://www.narf.at/~hvr/Haskell/parse_aeson.N12.eventlog.png http://www.narf.at/~hvr/Haskell/parse_aeson.N12.eventlog.xz
What's happening there? The actual processing work seems to be done in a single HEC... but what are the remaining 11 HECs doing exactly? Am I doing something wrong?
The answer is, they're all doing GC. When you say -N, the parallel GC is turned on, so every GC requires the cooperation of all cores. When you're running parallel code this is a big win, but for sequential code it could well be a slowdown. You probably want to turn off parallel GC for the young generation with "+RTS -qg1". If that doesn't help, try turning it off entirely: "+RTS -qg". I have experimented with making this happen automatically, but so far haven't managed to find a heuristic that improves performance consistently.
What does it mean when the GC-bars in threadscope becomes green as in the middle of the screenshot below?
The green bit is where the GC is traversing the heap, as opposed to the other administrative tasks that the GC does (such as waiting for the other HECs to stop, dealing with weak pointers, freeing memory, etc.). The green bars let you see when one core is doing most of the work during parallel GC, which can be an indication that your data structures are mainly linear (like lists) rather than tree-like. Cheers, Simon

Hello Simon, On Fri, 2011-06-17 at 09:05 +0100, Simon Marlow wrote:
What's happening there? The actual processing work seems to be done in a single HEC... but what are the remaining 11 HECs doing exactly? Am I doing something wrong?
The answer is, they're all doing GC. When you say -N, the parallel GC is turned on, so every GC requires the cooperation of all cores. When you're running parallel code this is a big win, but for sequential code it could well be a slowdown.
Speaking about cooperation of all cores... how much is the parallel GC affected by "multitasking-noise" (is there a better name for it?) in the system? There are two cases I'm thinking about: a) Say, I have a 8-core desktop workstation and run my GC-intensive (or massively parallel processing) Haskell program with "+RTS -N8", but I have a few desktop apps running, and using up a bit of CPU time (but on average just a few %) Does this already cause significant (i.e. measurable) synchronization delays due to 'non fully dedicated cores' in my Haskell program? b) What about virtualized guests (e.g. with VMware, KVM, etc)? Let's assume the Host system has 16 cores, and I partition those into 2 Guest VMs with 8 cores each assigned; Will there be a measurable/significant slow-downs due to synchronization delays in my "+RTS -N8" Haskell program? -- hvr

On 21/06/2011 10:02, Herbert Valerio Riedel wrote:
Hello Simon,
On Fri, 2011-06-17 at 09:05 +0100, Simon Marlow wrote:
What's happening there? The actual processing work seems to be done in a single HEC... but what are the remaining 11 HECs doing exactly? Am I doing something wrong?
The answer is, they're all doing GC. When you say -N, the parallel GC is turned on, so every GC requires the cooperation of all cores. When you're running parallel code this is a big win, but for sequential code it could well be a slowdown.
Speaking about cooperation of all cores... how much is the parallel GC affected by "multitasking-noise" (is there a better name for it?) in the system?
There are two cases I'm thinking about:
a) Say, I have a 8-core desktop workstation and run my GC-intensive (or massively parallel processing) Haskell program with "+RTS -N8", but I have a few desktop apps running, and using up a bit of CPU time (but on average just a few %)
Does this already cause significant (i.e. measurable) synchronization delays due to 'non fully dedicated cores' in my Haskell program?
Yes it can do, although since we started using 'yield' in the spinlock code the problem is less noticeable. I've been trying to address this problem with a new GC, for details see this paper: http://community.haskell.org/~simonmar/papers/local-gc.pdf There are a couple of workarounds if you're badly affected: - use a larger -A setting. This might also degrade performance due to more cache misses (try it and see). If your processor has lots of cache you might be able to go to -A1m or -A2m which reduces the GC frequency without impacting cache behaviour. - Don't use all the cores - e.g. use -N7 on an 8-core.
b) What about virtualized guests (e.g. with VMware, KVM, etc)? Let's assume the Host system has 16 cores, and I partition those into 2 Guest VMs with 8 cores each assigned; Will there be a measurable/significant slow-downs due to synchronization delays in my "+RTS -N8" Haskell program?
I haven't tried with a VM, it would be an interesting experiment! Cheers, Simon
participants (2)
-
Herbert Valerio Riedel
-
Simon Marlow