
#8763: forM_ [1..N] does not get fused (10 times slower than go function) -------------------------------------+------------------------------------- Reporter: nh2 | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: 8.6.1 Component: Compiler | Version: 7.6.3 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #7206 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sgraf): Here is a smaller example that highlights the problem without vectors. The only difference between the two functions is the use of `[2,3..n]` instead of `[2..n]`, which desugar to different functions. This results in a difference in a huge difference in allocation as well as runtime: {{{ $ ./repro 2 +RTS -s # [2,3..n] () 960,056,856 bytes allocated in the heap 21,536 bytes copied during GC 44,576 bytes maximum residency (2 sample(s)) 29,152 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 918 colls, 0 par 0.005s 0.003s 0.0000s 0.0000s Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0002s INIT time 0.000s ( 0.000s elapsed) MUT time 0.123s ( 0.125s elapsed) GC time 0.005s ( 0.003s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.129s ( 0.129s elapsed) %GC time 4.1% (2.5% elapsed) Alloc rate 7,778,808,106 bytes per MUT second Productivity 95.8% of total user, 97.4% of total elapsed }}} {{{ $ ./repro 1 +RTS -s # [2..n] () 56,872 bytes allocated in the heap 3,480 bytes copied during GC 44,576 bytes maximum residency (1 sample(s)) 25,056 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s INIT time 0.000s ( 0.000s elapsed) MUT time 0.048s ( 0.048s elapsed) GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.048s ( 0.048s elapsed) %GC time 0.2% (0.2% elapsed) Alloc rate 1,188,432 bytes per MUT second Productivity 99.6% of total user, 99.6% of total elapsed }}} This happens in `ST`, but not in `IO`, so probably related to some hack. Also the difference vanishes when we allow the functions to inline. Here's some Core for `g` (the offending function): {{{ -- RHS size: {terms: 235, types: 242, coercions: 61, joins: 4/13} $wg $wg = \ @ s ww w -> let { $wc = <huge> } in case <# ww 3# of { __DEFAULT -> let { y' y' = -# ww 1# } in letrec { go_up go_up = \ x eta -> case ># x y' of { __DEFAULT -> $wc x ((go_up (+# x 1#)) `cast` Co:4) eta; 1# -> $wc x (lvl `cast` Co:4) eta }; } in $wc 2# ((go_up 3#) `cast` Co:4) w; 1# -> case <# ww 2# of { __DEFAULT -> $wc 2# (lvl `cast` Co:4) w; 1# -> (# w, () #) } } }}} From my understanding of join points, `$wc` is only nearly a join point, because `go_up` with its transitive tail call to `$wc` appears in argument position. It would be great if we could get rid of this! The `IO` variant (`g 40000000 >>= print`) doesn't have this weakness, it's all join points there. Hence my suspicion about some `IO` hack that let's GHC eta-expand stuff more aggresively, but I'm not sure how that's helping. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8763#comment:44 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler