equivalent of EXPLAIN PLAN with GHC?

Hello, I'm quite new to Haskell, but experienced in other languages (C, Python, Ruby, SQL, etc). I am interested in Haskell because I've heard that the language is capable of lots of optimizations based on laziness, and I want to learn more about that. I dug in with Project Euler problem #1, and wrote: main = print (show (sum [x | x <- [3..999], x `mod` 3 == 0 || x `mod` 5 == 0])) So far so good, but I want to have some way of observing what optimizations GHC has performed. Most notably, in this example I want to know if the list was ever actually constructed in memory. The "sum" function only needs the elements one at a time, in order, so if Haskell and GHC are everything I've heard about them, I would fully expect the list construction to be optimized out. :) Unfortunately I was not able to see any way of examining ghc's output to determine whether it had performed this optimization. The C it produced with '-fvia-C -C' was totally unreadable -- it looked like something from the IOCCC. :( I couldn't find any way to match up any of the code it had generated with code I had written. My attempts to objdump the binaries was similarly unproductive. Is there any kind of intermediate form that a person can examine to see how their code is being optimized? Anything like EXPLAIN PLAN in SQL? It would make it much easier to understand the kinds of optimizations Haskell can perform. I'm not looking so much for profiling -- obvious this program is trivial and takes no time. I just want to better understand what kind of optimizations are possible given Haskell's language model. Thanks! Josh

Josh, In general you'll find the haskell-cafe (haskell-cafe@haskell.org) to be a more lively place for this type of discussion, but since we're here I might as well mention that memory use of a Haskell function is one of the hardest things to gain an understanding about.
main = print (show (sum [x | x <- [3..999], x `mod` 3 == 0 || x `mod` 5 == 0])) I want to know if the list was ever actually constructed in memory.
For such a simple program I suggest you test with lists of various lengths: ---------------- main with n == 999 ------------------- [tommd@Mavlo Test]$ ghc --make -O2 opt.hs [1 of 1] Compiling Main ( opt.hs, opt.o ) Linking opt ... [tommd@Mavlo Test]$ ./opt +RTS -sstderr ./opt +RTS -sstderr "233168" 94,604 bytes allocated in the heap 700 bytes copied during GC 17,144 bytes maximum residency (1 sample(s)) 23,816 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.00s ( 0.00s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.00s ( 0.00s elapsed) %GC time 0.0% (14.0% elapsed) Alloc rate 47,325,662 bytes per MUT second Productivity 0.0% of total user, 0.0% of total elapsed ---------------------------------------------------------------- ---------------- main with n == 10000000 ------------------- [tommd@Mavlo Test]$ ./opt +RTS -sstderr ./opt +RTS -sstderr "23333341666668" 906,451,272 bytes allocated in the heap 128,992 bytes copied during GC 18,104 bytes maximum residency (1 sample(s)) 18,760 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 1742 collections, 0 parallel, 0.01s, 0.01s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.99s ( 1.02s elapsed) GC time 0.01s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.00s ( 1.03s elapsed) %GC time 0.6% (1.0% elapsed) Alloc rate 912,980,911 bytes per MUT second Productivity 99.4% of total user, 96.7% of total elapsed ----------------------------------------------------------- So obviously there was a lot more data moving around but no significant difference in maximum memory use - a good sign!
The C it produced with '-fvia-C -C' was totally unreadable
Deprecating the via-C compilation path has been discussed - hopefully no one ever suggested it as a useful path for human inspection. Thomas

Hi Thomas, thanks for the reply!
Thomas DuBuisson
Josh,
In general you'll find the haskell-cafe (haskell-cafe <at> haskell.org) to be a more lively place for this type of discussion
Good to know, I just wasn't sure if it was appropriate for GHC-specific questions.
but since we're here I might as well mention that memory use of a Haskell function is one of the hardest things to gain an understanding about.
Hmm, that's unfortunate. :(
main = print (show (sum [x | x <- [3..999], x `mod` 3 == 0 || x `mod` 5 == 0])) I want to know if the list was ever actually constructed in memory.
For such a simple program I suggest you test with lists of various lengths:
Cool, thanks for the example.
94,604 bytes allocated in the heap
Is there any way I could find out what these 94kb of RAM were allocated for? This seems high -- my entire program's working set is <6kb. That's if you construct all my lists in full, including both the [3..999] list (which has 997 elements) and the list comprehension list (which has 466). This assumes 32-bit integers. Allow for some memory to format the string, and you still aren't within an order of magnitude of 94kb. I guess I'm holding Haskell to the standard of a compiled language rather than interpreted one, and assuming it doesn't need to allocate heap space for the code. Any idea how I could dig into this?
The C it produced with '-fvia-C -C' was totally unreadable
Deprecating the via-C compilation path has been discussed - hopefully no one ever suggested it as a useful path for human inspection.
I do still wonder if there isn't an intermediate form that *is* suitable for human inspection. Thanks, Josh

Hello Joshua, Sunday, August 2, 2009, 11:45:57 AM, you wrote:
94,604 bytes allocated in the heap
Is there any way I could find out what these 94kb of RAM were allocated for? This seems high -- my entire program's working set is <6kb.
as Don said, compiled code works on Int# (which is the same as C int) but probably not via registers. instead, each intermediate value allocated in heap, so there are total 90 bytes per iteration = about 22 integers -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello Joshua,
Sunday, August 2, 2009, 11:45:57 AM, you wrote:
94,604 bytes allocated in the heap
Is there any way I could find out what these 94kb of RAM were allocated for? This seems high -- my entire program's working set is <6kb.
as Don said, compiled code works on Int# (which is the same as C int) but probably not via registers. instead, each intermediate value allocated in heap, so there are total 90 bytes per iteration = about 22 integers
In my example, there's no heap allocation. In the version with build/foldr though, there will be heap nodes passed to sum. The fused one results in, 32,160 bytes allocated in the heap But there's no heap allocation in the filtering code, only once we have to construct the string to print. -- Don

joshua:
Hello, I'm quite new to Haskell, but experienced in other languages (C, Python, Ruby, SQL, etc). I am interested in Haskell because I've heard that the language is capable of lots of optimizations based on laziness, and I want to learn more about that.
I dug in with Project Euler problem #1, and wrote:
main = print (show (sum [x | x <- [3..999], x `mod` 3 == 0 || x `mod` 5 == 0]))
So far so good, but I want to have some way of observing what optimizations GHC has performed. Most notably, in this example I want to know if the list was ever actually constructed in memory. The "sum" function only needs the elements one at a time, in order, so if Haskell and GHC are everything I've heard about them, I would fully expect the list construction to be optimized out. :)
Unfortunately I was not able to see any way of examining ghc's output to determine whether it had performed this optimization. The C it produced with '-fvia-C -C' was totally unreadable -- it looked like something from the IOCCC. :( I couldn't find any way to match up any of the code it had generated with code I had written. My attempts to objdump the binaries was similarly unproductive.
Is there any kind of intermediate form that a person can examine to see how their code is being optimized? Anything like EXPLAIN PLAN in SQL? It would make it much easier to understand the kinds of optimizations Haskell can perform.
I'm not looking so much for profiling -- obvious this program is trivial and takes no time. I just want to better understand what kind of optimizations are possible given Haskell's language model.
So the optimization you're looking for here is fusion of some kind. GHC ships with build/foldr fusion, and there are libraries for an alternative system, stream fusion. GHC uses an intermediate representation called 'Core', which is a mini-Haskell, essentially, that is optimized repeatedly via type-preserving transformations. You can inspect this with a number of tools, including "ghc-core" (available on Hackage). Now, you're example uses a list comprehension (which is translated into an enumFromTo call, and a call to filter. It also uses a call to sum, which is a non-fusing left-fold under build/foldr fusion, but fuses under stream fusion. I'll desugar your code explicitly, and translate the calls from build/foldr to stream-fusible functions: $ cabal install uvector import Data.Array.Vector main = print . sumU . filterU (\x -> x `mod` 3 == 0 || x `mod` 5 == 0) $ enumFromToU 3 (999 :: Int) Running through ghc-core we see: 146 PreInlineUnconditionally 320 PostInlineUnconditionally 84 UnfoldingDone 18 RuleFired 7 +# 1 -# 1 <=# 2 ==#->case 1 ># 3 SC:$wfold0 1 fromIntegral/Int->Int 2 streamU/unstreamU 12 LetFloatFromLet 1 EtaReduction 210 BetaReduction 8 CaseOfCase 94 KnownBranch 4 CaseMerge 5 CaseElim 6 CaseIdentity 1 FillInCaseDefault 18 SimplifierDone Showing what transformations happened. Notably, 2 occurences of the "streamU/unstreamU" transformation, to remove intermediate structures. The final code looks like: $s$wfold :: Int# -> Int# $s$wfold = \ (sc_s19l :: Int#) -> case modInt# (-9223372036854775807) 3 of wild21_a14L { __DEFAULT -> case modInt# (-9223372036854775807) 5 of wild211_X159 { __DEFAULT -> $wfold sc_s19l (-9223372036854775806); 0 -> $wfold (+# sc_s19l (-9223372036854775807)) (-9223372036854775806) }; 0 -> $wfold (+# sc_s19l (-9223372036854775807)) (-9223372036854775806) } $wfold :: Int# -> Int# -> Int# $wfold = \ (ww_s189 :: Int#) (ww1_s18d :: Int#) -> case ># ww1_s18d 999 of wild_a15N { False -> case ww1_s18d of wild2_a14K { __DEFAULT -> case modInt# wild2_a14K 3 of wild21_a14L { __DEFAULT -> case modInt# wild2_a14K 5 of wild211_X159 { __DEFAULT -> $wfold ww_s189 (+# wild2_a14K 1); 0 -> $wfold (+# ww_s189 wild2_a14K) (+# wild2_a14K 1) }; 0 -> $wfold (+# ww_s189 wild2_a14K) (+# wild2_a14K 1) }; (-9223372036854775808) -> case modInt# (-9223372036854775808) 3 of wild21_a14N { __DEFAULT -> case lvl_r19G of wild211_a14x { False -> $s$wfold ww_s189; True -> $s$wfold (+# ww_s189 (-9223372036854775808)) }; 0 -> $s$wfold (+# ww_s189 (-9223372036854775808)) } }; True -> ww_s189 } Which might take a while to understand, but the key thing is the types at the top level are Int# -- they don't allocate [Int], but instead use unboxed, machine-level int values. Optimization succesful. There are papers and manuals describing "GHC Core" - the intermediate form - on haskell.org. For fusion, google for "deforestation" or "stream fusion". -- Don

dons:
Showing what transformations happened. Notably, 2 occurences of the "streamU/unstreamU" transformation, to remove intermediate structures.
The final code looks like:
$s$wfold :: Int# -> Int# $s$wfold = \ (sc_s19l :: Int#) -> case modInt# (-9223372036854775807) 3 of wild21_a14L { __DEFAULT -> case modInt# (-9223372036854775807) 5 of wild211_X159 { __DEFAULT -> $wfold sc_s19l (-9223372036854775806); 0 -> $wfold (+# sc_s19l (-9223372036854775807)) (-9223372036854775806) }; 0 -> $wfold (+# sc_s19l (-9223372036854775807)) (-9223372036854775806) } $wfold :: Int# -> Int# -> Int#
Oh, this is also a good illustration of why you should use `rem` instead of `mod` , if you're not expecting negative numbers: The core with filterU (\x -> x `rem` 3 == 0 || x `rem` 5 == 0) avoids all the checks for -9223372036854775806. $s$wfold :: Int# -> Int# $s$wfold = \ (sc_s18m :: Int#) -> $wfold sc_s18m (-9223372036854775806) $wfold :: Int# -> Int# -> Int# $wfold = \ (ww_s17v :: Int#) (ww1_s17z :: Int#) -> case ># ww1_s17z 999 of wild_a15d { False -> case ww1_s17z of wild2_a14K { __DEFAULT -> case remInt# wild2_a14K 3 of wild1_B1 { __DEFAULT -> case remInt# wild2_a14K 5 of wild21_Xp { __DEFAULT -> $wfold ww_s17v (+# wild2_a14K 1); 0 -> $wfold (+# ww_s17v wild2_a14K) (+# wild2_a14K 1) }; 0 -> $wfold (+# ww_s17v wild2_a14K) (+# wild2_a14K 1) }; (-9223372036854775808) -> $s$wfold ww_s17v }; True -> ww_s17v } Yes, that's better code. But remember, Prelude Test.QuickCheck> quickCheck (\x -> x `mod` 3 == x `rem` 3) Falsifiable, after 2 tests: -1 -- Don

Don Stewart
joshua:
Is there any kind of intermediate form that a person can examine to see how their code is being optimized? Anything like EXPLAIN PLAN in SQL? It would make it much easier to understand the kinds of optimizations Haskell can perform.
I'm not looking so much for profiling -- obvious this program is trivial and takes no time. I just want to better understand what kind of optimizations are possible given Haskell's language model.
So the optimization you're looking for here is fusion of some kind. GHC ships with build/foldr fusion, and there are libraries for an alternative system, stream fusion.
GHC uses an intermediate representation called 'Core', which is a mini-Haskell, essentially, that is optimized repeatedly via type-preserving transformations. You can inspect this with a number of tools, including "ghc-core" (available on Hackage).
Excellent Don, thanks a lot for taking the time to spell this out for me. I don't understand it all yet, but I know what to dig into next time I want to observe the compiler's behavior. Josh
participants (4)
-
Bulat Ziganshin
-
Don Stewart
-
Joshua Haberman
-
Thomas DuBuisson