Re: suboptimal ghc code generation in IO vs equivalent pure code case

I'm no expert on reading GHC's generated assembly. However, there may
be a line you've overlooked in explaining the difference, namely:
movq $stg_upd_frame_info,-16(%rbp)
This appears only in the IO code, according to what you've pasted, and
it appears to be pushing an update frame (I think). Update frames are
used as part of lazy evaluation, to ensure that work only happens
once, barring very short race conditions. So, at a guess, I would say
that your IO-based code is creating extra updatable closures, which
isn't free.
It's sometimes difficult to see the difference at the core level. It
will probably be clearer at the STG level, because the expression
language is more disciplined there. But, for instance, your pure code
tail calls (++), whereas your IO code returns an unboxed tuple with
the same sort of expression that is in the pure tail call. However,
complex expressions like that can't really be put in an unboxed tuple
at the STG level, so what will happen is that the complex expression
will be let (closure allocation), and the closure will be returned in
the unboxed tuple. So that is the source of the difference. A more
perspicuous picture would be something like:
Pure:
False ->
let {
l1 = : ww_amuh []
l2 = Data.Unicode.Internal.Normalization.decompose_$sdecompose
ipv_smuv ipv1_smuD
} in ++ l1 l2
IO:
False ->
case $sa1_sn0g ipv_smUT ipv1_smV6 ipv2_imWU
of _ { (# ipv4_XmXv, ipv5_XmXx #) ->
let {
l1 = : sc_sn0b []
l3 = ++ l1 ipv5_XmXx
} in (# ipv4_XmXv, l3 #)
I can't say for certain that that's the only thing making a
difference, but it might be one thing.
-- Dan
On Mon, May 9, 2016 at 10:23 AM, Harendra Kumar
I have a loop which runs millions of times. For some reason I have to run it in the IO monad. I noticed that when I convert the code from pure to IO monad the generated assembly code in essence is almost identical except one difference where it puts a piece of code in a separate block which is making a huge difference in performance (4-6x slower).
I want to understand what makes GHC to generate code in this way and if there is anything that can be done at source level (or ghc option) to control that.
The pure code looks like this:
decomposeChars :: [Char] -> [Char]
decomposeChars [] = [] decomposeChars [x] = case NFD.isDecomposable x of True -> decomposeChars (NFD.decomposeChar x) False -> [x] decomposeChars (x : xs) = decomposeChars [x] ++ decomposeChars xs
The equivalent IO code is this:
decomposeStrIO :: [Char] -> IO [Char]
decomposeStrPtr !p = decomposeStrIO where decomposeStrIO [] = return [] decomposeStrIO [x] = do res <- NFD.isDecomposable p x case res of True -> decomposeStrIO (NFD.decomposeChar x) False -> return [x] decomposeStrIO (x : xs) = do s1 <- decomposeStrIO [x] s2 <- decomposeStrIO xs return (s1 ++ s2)
The difference is in how the code corresponding to the call to the (++) operation is generated. In the pure case the (++) operation is inline in the main loop:
_cn5N: movq $sat_sn2P_info,-48(%r12) movq %rax,-32(%r12) movq %rcx,-24(%r12) movq $:_con_info,-16(%r12) movq 16(%rbp),%rax movq %rax,-8(%r12) movq $GHC.Types.[]_closure+1,(%r12) leaq -48(%r12),%rsi leaq -14(%r12),%r14 addq $40,%rbp jmp GHC.Base.++_info
In the IO monad version this code is placed in a separate block and a call is placed in the main loop:
the main loop call site:
_cn6A: movq $sat_sn3w_info,-24(%r12) movq 8(%rbp),%rax movq %rax,-8(%r12) movq %rbx,(%r12) leaq -24(%r12),%rbx addq $40,%rbp jmp *(%rbp)
out of the line block - the code that was in the main loop in the previous case is now moved to this block (see label _cn5s below):
sat_sn3w_info: _cn5p: leaq -16(%rbp),%rax cmpq %r15,%rax jb _cn5q _cn5r: addq $24,%r12 cmpq 856(%r13),%r12 ja _cn5t _cn5s: movq $stg_upd_frame_info,-16(%rbp) movq %rbx,-8(%rbp) movq 16(%rbx),%rax movq 24(%rbx),%rbx movq $:_con_info,-16(%r12) movq %rax,-8(%r12) movq $GHC.Types.[]_closure+1,(%r12) movq %rbx,%rsi leaq -14(%r12),%r14 addq $-16,%rbp jmp GHC.Base.++_info _cn5t: movq $24,904(%r13) _cn5q: jmp *-16(%r13)
Except this difference the rest of the assembly looks pretty similar in both the cases. The corresponding dump-simpl output for the pure case:
False -> ++ @ Char (GHC.Types.: @ Char ww_amuh (GHC.Types.[] @ Char)) (Data.Unicode.Internal.Normalization.decompose_$sdecompose ipv_smuv ipv1_smuD);
And for the IO monad version:
False -> case $sa1_sn0g ipv_smUT ipv1_smV6 ipv2_imWU of _ [Occ=Dead] { (# ipv4_XmXv, ipv5_XmXx #) -> (# ipv4_XmXv, ++ @ Char (GHC.Types.: @ Char sc_sn0b (GHC.Types.[] @ Char)) ipv5_XmXx #) };
The dump-simpl output is essentially the same except the difference due to the realworld token in the IO case. Why is the generated code different? I will appreciate if someone can throw some light on the reason or can point to the relevant ghc source to look at where this happens.
I am using ghc-7.10.3 in native code generation mode (no llvm).
Thanks, Harendra
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

Thanks Dan, that helped. I did notice and suspect the update frame and the
unboxed tuple but given my limited knowledge about ghc/core/stg/cmm I was
not sure what is going on. In fact I thought that the intermediate tuple
should get optimized out since it is required only because of the realworld
token which is not real. But it might be difficult to see that at this
level?
What you are saying may be true for the current implementation but in
theory can we eliminate the intermediate closure?
I observed that the same unboxed tuple results in an inline code for (:)
constructor. I rewrote the source a bit differently which resulted in this
core:
case ipv1_amWI of _ [Occ=Dead] {
False ->
case a1_smWm xs_aloV ipv_amWH
of _ [Occ=Dead] { (# ipv2_XmXf, ipv3_XmXh #) ->
(# ipv2_XmXf, GHC.Types.: @ Char x_aloU ipv3_XmXh #)
};
True ->
case a1_smWm (NFD.decomposeChar x_aloU) ipv_amWH
of _ [Occ=Dead] { (# ipv2_XmXf, ipv3_XmXh #) ->
case a1_smWm xs_aloV ipv2_XmXf
of _ [Occ=Dead] { (# ipv4_XmXk, ipv5_XmXm #) ->
(# ipv4_XmXk, ++ @ Char ipv3_XmXh ipv5_XmXm #)
}
}
}
We can see that both True and the False case are returning an unboxed
tuple. The only difference is (:) vs (++). But the generated assembly is
different for both cases:
(++) has a closure as before:
sat_sn0Z_info:
_cn2c:
leaq -16(%rbp),%rax
cmpq %r15,%rax
jb _cn2d
_cn2e:
movq $stg_upd_frame_info,-16(%rbp)
movq %rbx,-8(%rbp)
movq 24(%rbx),%rsi
movq 16(%rbx),%r14
addq $-16,%rbp
jmp GHC.Base.++_info
(:) is treated differently and generated inline as you can see below:
_cn2I:
movq $sat_sn0Z_info,-24(%r12)
movq 8(%rbp),%rax
movq %rax,-8(%r12)
movq %rbx,(%r12)
leaq -24(%r12),%rbx
addq $16,%rbp
jmp *(%rbp)
.
.
.
block_cn2u_info:
_cn2u:
addq $24,%r12
cmpq 856(%r13),%r12
ja _cn2C
_cn2B:
movq $:_con_info,-16(%r12)
movq 8(%rbp),%rax
movq %rax,-8(%r12)
movq %rbx,(%r12)
leaq -14(%r12),%rbx
addq $24,%rbp
jmp *(%rbp)
So why is that? Why can't we generate (++) inline similar to (:)? How do we
make this decision? Is there a theoretical reason for this or this is just
an implementation artifact?
Since the tuple is required only for passing the realworld token, ideally I
wouldn't want it to make a difference to the generated code. Otherwise I
would always have to worry about such performance side effects when using
ST/IO.
-harendra
On 10 May 2016 at 06:51, Dan Doel
I'm no expert on reading GHC's generated assembly. However, there may be a line you've overlooked in explaining the difference, namely:
movq $stg_upd_frame_info,-16(%rbp)
This appears only in the IO code, according to what you've pasted, and it appears to be pushing an update frame (I think). Update frames are used as part of lazy evaluation, to ensure that work only happens once, barring very short race conditions. So, at a guess, I would say that your IO-based code is creating extra updatable closures, which isn't free.
It's sometimes difficult to see the difference at the core level. It will probably be clearer at the STG level, because the expression language is more disciplined there. But, for instance, your pure code tail calls (++), whereas your IO code returns an unboxed tuple with the same sort of expression that is in the pure tail call. However, complex expressions like that can't really be put in an unboxed tuple at the STG level, so what will happen is that the complex expression will be let (closure allocation), and the closure will be returned in the unboxed tuple. So that is the source of the difference. A more perspicuous picture would be something like:
Pure: False -> let { l1 = : ww_amuh [] l2 = Data.Unicode.Internal.Normalization.decompose_$sdecompose ipv_smuv ipv1_smuD } in ++ l1 l2
IO: False -> case $sa1_sn0g ipv_smUT ipv1_smV6 ipv2_imWU of _ { (# ipv4_XmXv, ipv5_XmXx #) -> let { l1 = : sc_sn0b [] l3 = ++ l1 ipv5_XmXx } in (# ipv4_XmXv, l3 #)
I can't say for certain that that's the only thing making a difference, but it might be one thing.
-- Dan
I have a loop which runs millions of times. For some reason I have to run it in the IO monad. I noticed that when I convert the code from pure to IO monad the generated assembly code in essence is almost identical except one difference where it puts a piece of code in a separate block which is making a huge difference in performance (4-6x slower).
I want to understand what makes GHC to generate code in this way and if there is anything that can be done at source level (or ghc option) to control that.
The pure code looks like this:
decomposeChars :: [Char] -> [Char]
decomposeChars [] = [] decomposeChars [x] = case NFD.isDecomposable x of True -> decomposeChars (NFD.decomposeChar x) False -> [x] decomposeChars (x : xs) = decomposeChars [x] ++ decomposeChars xs
The equivalent IO code is this:
decomposeStrIO :: [Char] -> IO [Char]
decomposeStrPtr !p = decomposeStrIO where decomposeStrIO [] = return [] decomposeStrIO [x] = do res <- NFD.isDecomposable p x case res of True -> decomposeStrIO (NFD.decomposeChar x) False -> return [x] decomposeStrIO (x : xs) = do s1 <- decomposeStrIO [x] s2 <- decomposeStrIO xs return (s1 ++ s2)
The difference is in how the code corresponding to the call to the (++) operation is generated. In the pure case the (++) operation is inline in
main loop:
_cn5N: movq $sat_sn2P_info,-48(%r12) movq %rax,-32(%r12) movq %rcx,-24(%r12) movq $:_con_info,-16(%r12) movq 16(%rbp),%rax movq %rax,-8(%r12) movq $GHC.Types.[]_closure+1,(%r12) leaq -48(%r12),%rsi leaq -14(%r12),%r14 addq $40,%rbp jmp GHC.Base.++_info
In the IO monad version this code is placed in a separate block and a call is placed in the main loop:
the main loop call site:
_cn6A: movq $sat_sn3w_info,-24(%r12) movq 8(%rbp),%rax movq %rax,-8(%r12) movq %rbx,(%r12) leaq -24(%r12),%rbx addq $40,%rbp jmp *(%rbp)
out of the line block - the code that was in the main loop in the
case is now moved to this block (see label _cn5s below):
sat_sn3w_info: _cn5p: leaq -16(%rbp),%rax cmpq %r15,%rax jb _cn5q _cn5r: addq $24,%r12 cmpq 856(%r13),%r12 ja _cn5t _cn5s: movq $stg_upd_frame_info,-16(%rbp) movq %rbx,-8(%rbp) movq 16(%rbx),%rax movq 24(%rbx),%rbx movq $:_con_info,-16(%r12) movq %rax,-8(%r12) movq $GHC.Types.[]_closure+1,(%r12) movq %rbx,%rsi leaq -14(%r12),%r14 addq $-16,%rbp jmp GHC.Base.++_info _cn5t: movq $24,904(%r13) _cn5q: jmp *-16(%r13)
Except this difference the rest of the assembly looks pretty similar in both the cases. The corresponding dump-simpl output for the pure case:
False -> ++ @ Char (GHC.Types.: @ Char ww_amuh (GHC.Types.[] @ Char)) (Data.Unicode.Internal.Normalization.decompose_$sdecompose ipv_smuv ipv1_smuD);
And for the IO monad version:
False -> case $sa1_sn0g ipv_smUT ipv1_smV6 ipv2_imWU of _ [Occ=Dead] { (# ipv4_XmXv, ipv5_XmXx #) -> (# ipv4_XmXv, ++ @ Char (GHC.Types.: @ Char sc_sn0b (GHC.Types.[] @ Char)) ipv5_XmXx #) };
The dump-simpl output is essentially the same except the difference due to the realworld token in the IO case. Why is the generated code different? I will appreciate if someone can throw some light on the reason or can
On Mon, May 9, 2016 at 10:23 AM, Harendra Kumar
wrote: the previous point to the relevant ghc source to look at where this happens.
I am using ghc-7.10.3 in native code generation mode (no llvm).
Thanks, Harendra
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar
Thanks Dan, that helped. I did notice and suspect the update frame and the unboxed tuple but given my limited knowledge about ghc/core/stg/cmm I was not sure what is going on. In fact I thought that the intermediate tuple should get optimized out since it is required only because of the realworld token which is not real. But it might be difficult to see that at this level?
The token exists as far as the STG level is concerned, I think, because that is the only thing ensuring that things happen in the right order. And the closure must be built to have properly formed STG. So optimizing away the closure building would have to happen at a level lower than STG, and I guess there is no such optimization. I'm not sure how easy it would be to do.
What you are saying may be true for the current implementation but in theory can we eliminate the intermediate closure?
I don't know. I'm a bit skeptical that building this one closure is the only thing causing your code to be a factor of 5 slower. For instance, another difference in the core is that the recursive call corresponding to the result s2 happens before allocating the additional closure. That is the case statement that unpacks the unboxed tuple. And the whole loop happens this way, so it is actually building a structure corresponding to the entire output list in memory rather eagerly. By contrast, your pure function is able to act in a streaming fashion, if consumed properly, where only enough of the result is built to keep driving the rest of the program. It probably runs in constant space, while your IO-based loop has a footprint linear in the size of the input list (in addition to having slightly more overhead per character because of the one extra thunk), because it is a more eager program. And having an asymptotically larger memory footprint is more likely to explain a 5x slowdown than allocating one extra thunk for each list concatenation, I think. (Of course, it could also be some other factor, as well.) You should probably run with +RTS -s (compiling with -rtsopts), and see if the IO version has a much larger maximum residency. Anyhow, this is fundamental to writing the algorithm using IO. It's an algorithm that's a sequence of steps that happen in order, the number of steps is proportional to the input list, and part of those steps is putting stuff in memory for the results.
So why is that? Why can't we generate (++) inline similar to (:)? How do we make this decision? Is there a theoretical reason for this or this is just an implementation artifact?
The difference between these two is that (++) is a function call, and (:) is a constructor. Constructors are a special case, and don't need to have all the provisions that functions in general do. The best way to learn what the differences are is probably to read the paper about the STG machine. Hope this is a more fruitful lead, but it may be that there's not a lot that can be done, without doing some baroque things to your algorithm, because of the necessity of its using IO. -- Dan

You are right about the way IO code is generated because of the ordering
semantics. The IO version builds the list entirely in a recursive fashion
before returning it while the pure code builds it lazily. I wrote very
simple versions to keep things simpler:
Pure version:
f [] = []
f (x : xs) = x : f xs
time 11.08 ms (10.86 ms .. 11.34 ms)
Measured for a million elements in the list
104,041,264 bytes allocated in the heap
16,728 bytes copied during GC
35,992 bytes maximum residency (1 sample(s))
21,352 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
IO version:
f [] = return []
f (x : xs) = do
rest <- f xs
return $ x : rest
time 79.66 ms (75.49 ms .. 82.55 ms)
208,654,560 bytes allocated in the heap
121,045,336 bytes copied during GC
27,679,344 bytes maximum residency (8 sample(s))
383,376 bytes maximum slop
66 MB total memory in use (0 MB lost due to fragmentation)
Even though this is a small program not doing much and therefore enhancing
even small differences to a great degree, I am not sure if I can completely
explain the difference in slowness of the order of 7.5x by just the
recursive vs lazy building of the list. I am wondering if there is anything
that is worth further investigating and improving here.
-harendra
On 12 May 2016 at 05:41, Dan Doel
On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar
wrote: Thanks Dan, that helped. I did notice and suspect the update frame and
unboxed tuple but given my limited knowledge about ghc/core/stg/cmm I was not sure what is going on. In fact I thought that the intermediate tuple should get optimized out since it is required only because of the realworld token which is not real. But it might be difficult to see that at this level?
The token exists as far as the STG level is concerned, I think, because that is the only thing ensuring that things happen in the right order. And the closure must be built to have properly formed STG. So optimizing away the closure building would have to happen at a level lower than STG, and I guess there is no such optimization. I'm not sure how easy it would be to do.
What you are saying may be true for the current implementation but in
the theory
can we eliminate the intermediate closure?
I don't know. I'm a bit skeptical that building this one closure is the only thing causing your code to be a factor of 5 slower. For instance, another difference in the core is that the recursive call corresponding to the result s2 happens before allocating the additional closure. That is the case statement that unpacks the unboxed tuple. And the whole loop happens this way, so it is actually building a structure corresponding to the entire output list in memory rather eagerly.
By contrast, your pure function is able to act in a streaming fashion, if consumed properly, where only enough of the result is built to keep driving the rest of the program. It probably runs in constant space, while your IO-based loop has a footprint linear in the size of the input list (in addition to having slightly more overhead per character because of the one extra thunk), because it is a more eager program.
And having an asymptotically larger memory footprint is more likely to explain a 5x slowdown than allocating one extra thunk for each list concatenation, I think. (Of course, it could also be some other factor, as well.)
You should probably run with +RTS -s (compiling with -rtsopts), and see if the IO version has a much larger maximum residency.
Anyhow, this is fundamental to writing the algorithm using IO. It's an algorithm that's a sequence of steps that happen in order, the number of steps is proportional to the input list, and part of those steps is putting stuff in memory for the results.
So why is that? Why can't we generate (++) inline similar to (:)? How do we make this decision? Is there a theoretical reason for this or this is just an implementation artifact?
The difference between these two is that (++) is a function call, and (:) is a constructor. Constructors are a special case, and don't need to have all the provisions that functions in general do. The best way to learn what the differences are is probably to read the paper about the STG machine.
Hope this is a more fruitful lead, but it may be that there's not a lot that can be done, without doing some baroque things to your algorithm, because of the necessity of its using IO.
-- Dan

The difference seems to be entirely due to memory pressure. At list size
1000 both pure version and IO version perform equally. But as the size of
the list increases the pure version scales linearly while the IO version
degrades exponentially. Here are the execution times per list element in ns
as the list size increases:
Size of list Pure IO
1000 8.7 8.3
10000 8.7 18
100000 8.8 63
1000000 9.3 786
This seems to be due to increased GC activity in the IO case. The GC stats
for list size 1 million are:
IO case: %GC time 66.1% (61.1% elapsed)
Pure case: %GC time 2.6% (3.3% elapsed)
Not sure if there is a way to write this code in IO monad which can reduce
this overhead.
-harendra
On 14 May 2016 at 17:10, Harendra Kumar
You are right about the way IO code is generated because of the ordering
semantics. The IO version builds the list entirely in a recursive fashion before returning it while the pure code builds it lazily. I wrote very simple versions to keep things simpler:
Pure version:
f [] = [] f (x : xs) = x : f xs
time 11.08 ms (10.86 ms .. 11.34 ms) Measured for a million elements in the list
104,041,264 bytes allocated in the heap 16,728 bytes copied during GC 35,992 bytes maximum residency (1 sample(s)) 21,352 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)
IO version: f [] = return [] f (x : xs) = do rest <- f xs return $ x : rest
time 79.66 ms (75.49 ms .. 82.55 ms)
208,654,560 bytes allocated in the heap 121,045,336 bytes copied during GC 27,679,344 bytes maximum residency (8 sample(s)) 383,376 bytes maximum slop 66 MB total memory in use (0 MB lost due to fragmentation)
Even though this is a small program not doing much and therefore
enhancing even small differences to a great degree, I am not sure if I can completely explain the difference in slowness of the order of 7.5x by just the recursive vs lazy building of the list. I am wondering if there is anything that is worth further investigating and improving here.
-harendra
On 12 May 2016 at 05:41, Dan Doel
wrote: On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar
wrote: Thanks Dan, that helped. I did notice and suspect the update frame
unboxed tuple but given my limited knowledge about ghc/core/stg/cmm I was not sure what is going on. In fact I thought that the intermediate tuple should get optimized out since it is required only because of the realworld token which is not real. But it might be difficult to see that at this level?
The token exists as far as the STG level is concerned, I think, because that is the only thing ensuring that things happen in the right order. And the closure must be built to have properly formed STG. So optimizing away the closure building would have to happen at a level lower than STG, and I guess there is no such optimization. I'm not sure how easy it would be to do.
What you are saying may be true for the current implementation but in
and the theory
can we eliminate the intermediate closure?
I don't know. I'm a bit skeptical that building this one closure is the only thing causing your code to be a factor of 5 slower. For instance, another difference in the core is that the recursive call corresponding to the result s2 happens before allocating the additional closure. That is the case statement that unpacks the unboxed tuple. And the whole loop happens this way, so it is actually building a structure corresponding to the entire output list in memory rather eagerly.
By contrast, your pure function is able to act in a streaming fashion, if consumed properly, where only enough of the result is built to keep driving the rest of the program. It probably runs in constant space, while your IO-based loop has a footprint linear in the size of the input list (in addition to having slightly more overhead per character because of the one extra thunk), because it is a more eager program.
And having an asymptotically larger memory footprint is more likely to explain a 5x slowdown than allocating one extra thunk for each list concatenation, I think. (Of course, it could also be some other factor, as well.)
You should probably run with +RTS -s (compiling with -rtsopts), and see if the IO version has a much larger maximum residency.
Anyhow, this is fundamental to writing the algorithm using IO. It's an algorithm that's a sequence of steps that happen in order, the number of steps is proportional to the input list, and part of those steps is putting stuff in memory for the results.
So why is that? Why can't we generate (++) inline similar to (:)? How do we make this decision? Is there a theoretical reason for this or this is just an implementation artifact?
The difference between these two is that (++) is a function call, and (:) is a constructor. Constructors are a special case, and don't need to have all the provisions that functions in general do. The best way to learn what the differences are is probably to read the paper about the STG machine.
Hope this is a more fruitful lead, but it may be that there's not a lot that can be done, without doing some baroque things to your algorithm, because of the necessity of its using IO.
-- Dan

Well, a few weeks ago Bertram Felgenhauer came up with a version of IO that
acts more like lazy ST. That could be just the thing. He placed it in the
public domain/CC0 and told me I could put it up on Hackage if I want. I'll
try to do that this week, but no promises. I could forward his email if you
just want to try it out.
On May 14, 2016 2:31 PM, "Harendra Kumar"
The difference seems to be entirely due to memory pressure. At list size 1000 both pure version and IO version perform equally. But as the size of the list increases the pure version scales linearly while the IO version degrades exponentially. Here are the execution times per list element in ns as the list size increases:
Size of list Pure IO 1000 8.7 8.3 10000 8.7 18 100000 8.8 63 1000000 9.3 786
This seems to be due to increased GC activity in the IO case. The GC stats for list size 1 million are:
IO case: %GC time 66.1% (61.1% elapsed) Pure case: %GC time 2.6% (3.3% elapsed)
Not sure if there is a way to write this code in IO monad which can reduce this overhead.
-harendra
On 14 May 2016 at 17:10, Harendra Kumar
wrote: You are right about the way IO code is generated because of the ordering
semantics. The IO version builds the list entirely in a recursive fashion before returning it while the pure code builds it lazily. I wrote very simple versions to keep things simpler:
Pure version:
f [] = [] f (x : xs) = x : f xs
time 11.08 ms (10.86 ms .. 11.34 ms) Measured for a million elements in the list
104,041,264 bytes allocated in the heap 16,728 bytes copied during GC 35,992 bytes maximum residency (1 sample(s)) 21,352 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)
IO version: f [] = return [] f (x : xs) = do rest <- f xs return $ x : rest
time 79.66 ms (75.49 ms .. 82.55 ms)
208,654,560 bytes allocated in the heap 121,045,336 bytes copied during GC 27,679,344 bytes maximum residency (8 sample(s)) 383,376 bytes maximum slop 66 MB total memory in use (0 MB lost due to fragmentation)
Even though this is a small program not doing much and therefore
enhancing even small differences to a great degree, I am not sure if I can completely explain the difference in slowness of the order of 7.5x by just the recursive vs lazy building of the list. I am wondering if there is anything that is worth further investigating and improving here.
-harendra
On 12 May 2016 at 05:41, Dan Doel
wrote: On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar
wrote: Thanks Dan, that helped. I did notice and suspect the update frame
unboxed tuple but given my limited knowledge about ghc/core/stg/cmm I was not sure what is going on. In fact I thought that the intermediate tuple should get optimized out since it is required only because of the realworld token which is not real. But it might be difficult to see that at
and the this
level?
The token exists as far as the STG level is concerned, I think, because that is the only thing ensuring that things happen in the right order. And the closure must be built to have properly formed STG. So optimizing away the closure building would have to happen at a level lower than STG, and I guess there is no such optimization. I'm not sure how easy it would be to do.
What you are saying may be true for the current implementation but in theory can we eliminate the intermediate closure?
I don't know. I'm a bit skeptical that building this one closure is the only thing causing your code to be a factor of 5 slower. For instance, another difference in the core is that the recursive call corresponding to the result s2 happens before allocating the additional closure. That is the case statement that unpacks the unboxed tuple. And the whole loop happens this way, so it is actually building a structure corresponding to the entire output list in memory rather eagerly.
By contrast, your pure function is able to act in a streaming fashion, if consumed properly, where only enough of the result is built to keep driving the rest of the program. It probably runs in constant space, while your IO-based loop has a footprint linear in the size of the input list (in addition to having slightly more overhead per character because of the one extra thunk), because it is a more eager program.
And having an asymptotically larger memory footprint is more likely to explain a 5x slowdown than allocating one extra thunk for each list concatenation, I think. (Of course, it could also be some other factor, as well.)
You should probably run with +RTS -s (compiling with -rtsopts), and see if the IO version has a much larger maximum residency.
Anyhow, this is fundamental to writing the algorithm using IO. It's an algorithm that's a sequence of steps that happen in order, the number of steps is proportional to the input list, and part of those steps is putting stuff in memory for the results.
So why is that? Why can't we generate (++) inline similar to (:)? How do we make this decision? Is there a theoretical reason for this or this is just an implementation artifact?
The difference between these two is that (++) is a function call, and (:) is a constructor. Constructors are a special case, and don't need to have all the provisions that functions in general do. The best way to learn what the differences are is probably to read the paper about the STG machine.
Hope this is a more fruitful lead, but it may be that there's not a lot that can be done, without doing some baroque things to your algorithm, because of the necessity of its using IO.
-- Dan
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

On 15 May 2016 at 01:35, David Feuer
Well, a few weeks ago Bertram Felgenhauer came up with a version of IO that acts more like lazy ST. That could be just the thing. He placed it in the public domain/CC0 and told me I could put it up on Hackage if I want. I'll try to do that this week, but no promises. I could forward his email if you just want to try it out.
That's exactly what I was thinking about. Can you please forward it to me, I will try it out. Thanks, Harendra
On May 14, 2016 2:31 PM, "Harendra Kumar"
wrote: The difference seems to be entirely due to memory pressure. At list size 1000 both pure version and IO version perform equally. But as the size of the list increases the pure version scales linearly while the IO version degrades exponentially. Here are the execution times per list element in ns as the list size increases:
Size of list Pure IO 1000 8.7 8.3 10000 8.7 18 100000 8.8 63 1000000 9.3 786
This seems to be due to increased GC activity in the IO case. The GC stats for list size 1 million are:
IO case: %GC time 66.1% (61.1% elapsed) Pure case: %GC time 2.6% (3.3% elapsed)
Not sure if there is a way to write this code in IO monad which can reduce this overhead.
-harendra
On 14 May 2016 at 17:10, Harendra Kumar
wrote: You are right about the way IO code is generated because of the
ordering semantics. The IO version builds the list entirely in a recursive fashion before returning it while the pure code builds it lazily. I wrote very simple versions to keep things simpler:
Pure version:
f [] = [] f (x : xs) = x : f xs
time 11.08 ms (10.86 ms .. 11.34 ms) Measured for a million elements in the list
104,041,264 bytes allocated in the heap 16,728 bytes copied during GC 35,992 bytes maximum residency (1 sample(s)) 21,352 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)
IO version: f [] = return [] f (x : xs) = do rest <- f xs return $ x : rest
time 79.66 ms (75.49 ms .. 82.55 ms)
208,654,560 bytes allocated in the heap 121,045,336 bytes copied during GC 27,679,344 bytes maximum residency (8 sample(s)) 383,376 bytes maximum slop 66 MB total memory in use (0 MB lost due to fragmentation)
Even though this is a small program not doing much and therefore
enhancing even small differences to a great degree, I am not sure if I can completely explain the difference in slowness of the order of 7.5x by just the recursive vs lazy building of the list. I am wondering if there is anything that is worth further investigating and improving here.
-harendra
On 12 May 2016 at 05:41, Dan Doel
wrote: On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar
wrote: Thanks Dan, that helped. I did notice and suspect the update frame
unboxed tuple but given my limited knowledge about ghc/core/stg/cmm I was not sure what is going on. In fact I thought that the intermediate tuple should get optimized out since it is required only because of the realworld token which is not real. But it might be difficult to see that at
and the this
level?
The token exists as far as the STG level is concerned, I think, because that is the only thing ensuring that things happen in the right order. And the closure must be built to have properly formed STG. So optimizing away the closure building would have to happen at a level lower than STG, and I guess there is no such optimization. I'm not sure how easy it would be to do.
What you are saying may be true for the current implementation but in theory can we eliminate the intermediate closure?
I don't know. I'm a bit skeptical that building this one closure is the only thing causing your code to be a factor of 5 slower. For instance, another difference in the core is that the recursive call corresponding to the result s2 happens before allocating the additional closure. That is the case statement that unpacks the unboxed tuple. And the whole loop happens this way, so it is actually building a structure corresponding to the entire output list in memory rather eagerly.
By contrast, your pure function is able to act in a streaming fashion, if consumed properly, where only enough of the result is built to keep driving the rest of the program. It probably runs in constant space, while your IO-based loop has a footprint linear in the size of the input list (in addition to having slightly more overhead per character because of the one extra thunk), because it is a more eager program.
And having an asymptotically larger memory footprint is more likely to explain a 5x slowdown than allocating one extra thunk for each list concatenation, I think. (Of course, it could also be some other factor, as well.)
You should probably run with +RTS -s (compiling with -rtsopts), and see if the IO version has a much larger maximum residency.
Anyhow, this is fundamental to writing the algorithm using IO. It's an algorithm that's a sequence of steps that happen in order, the number of steps is proportional to the input list, and part of those steps is putting stuff in memory for the results.
So why is that? Why can't we generate (++) inline similar to (:)? How do we make this decision? Is there a theoretical reason for this or this is just an implementation artifact?
The difference between these two is that (++) is a function call, and (:) is a constructor. Constructors are a special case, and don't need to have all the provisions that functions in general do. The best way to learn what the differences are is probably to read the paper about the STG machine.
Hope this is a more fruitful lead, but it may be that there's not a lot that can be done, without doing some baroque things to your algorithm, because of the necessity of its using IO.
-- Dan
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
participants (3)
-
Dan Doel
-
David Feuer
-
Harendra Kumar