forLoop + strict State monad is much faster than foldl'

This is just a short notice that using foldl' (+) 0 [0..100000::Int] is over 10 times slower than using flip execState 0 $ forLoop (0 :: Int) (< n) (+1) $ \i -> do x <- get put $! x + i with `loopFor` as on https://github.com/nh2/loop. Even using an IORef is twice as fast as the pure foldl' (but still 5 times slower than strict State). The benchmark is at http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result.... (All benchmarks are linked from https://github.com/nh2/loop.) You can see there that the problem is gone when using Vector.foldl', but only for Int - for Word32 it persists. It seems that manual looping is beneficial even when dealing with prime examples of pure code that GHC ought to optimize well. Niklas

Interestingly, the discrimination against Word32 does not end here: When compiling with -fllvm (http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result...), we can see that the forLoop + strict State monad is completely compiled away to a no-op for Int, but not for Word32.

Em 29-04-2014 13:35, Niklas Hambüchen escreveu:
(http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result...),
Is it just or is criterion broken in some way? It considers a 1140% effect on the mean "moderate", so I would guess that the number is being incorrectly printed. One of the tests even has a "severe" 7280% effect! Also, for "sum32/sumIntStrictState" is reports microseconds on the charts but seconds on the table. Cheers, =) -- Felipe.

On 29/04/14 17:43, Felipe Lessa wrote:
Also, for "sum32/sumIntStrictState" is reports microseconds on the charts but seconds on the table.
I have noticed these things before as well, but I don't know what causes them and it only seems to happen sometimes :/ I filed it as https://github.com/bos/criterion/issues/50

Regardless, that report is garbage. It's not worth the paper it's printed
on! Maybe re-run it and upload new results?
On Tue, Apr 29, 2014 at 9:53 AM, Niklas Hambüchen
On 29/04/14 17:43, Felipe Lessa wrote:
Also, for "sum32/sumIntStrictState" is reports microseconds on the charts but seconds on the table.
I have noticed these things before as well, but I don't know what causes them and it only seems to happen sometimes :/
I filed it as https://github.com/bos/criterion/issues/50 _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Em 29-04-2014 15:31, John Lato escreveu:
Regardless, that report is garbage. It's not worth the paper it's printed on! Maybe re-run it and upload new results?
That's very harsh, why is it garbage? Cheers, =) -- Felipe.

because the events are on millisecond and microsecond scales, yet the means
are being reported on the order of seconds.
This is a strong piece of evidence the the claims therein are fundamentally
wrong, confusing and misleading and wrong.
On Tue, Apr 29, 2014 at 2:34 PM, Felipe Lessa
Em 29-04-2014 15:31, John Lato escreveu:
Regardless, that report is garbage. It's not worth the paper it's printed on! Maybe re-run it and upload new results?
That's very harsh, why is it garbage?
Cheers, =)
-- Felipe.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I think we are dealing with a much more trivial, technical issue here: It looks like http://htmlpreview.github.io and Criterion's Javascript do not play well together. If I wget the raw file, https://raw.githubusercontent.com/nh2/loop/master/results/bench-foldl-and-io..., and view it in my browser, all the number reported are suddenly correct. Can you check if that works for you? On 29/04/14 19:55, Carter Schonwald wrote:
because the events are on millisecond and microsecond scales, yet the means are being reported on the order of seconds.
This is a strong piece of evidence the the claims therein are fundamentally wrong, confusing and misleading and wrong.

Look, when I serve it via a different service, the report looks all fine: https://rawgit.com/nh2/loop/master/results/bench-foldl-and-iorefs-are-slow.h... I should probably switch it to hosting it via Github pages once I figure out how to do that comfortably.
This is a strong piece of evidence the the claims therein are fundamentally wrong, confusing and misleading and wrong.
Now back to the claims!

Yes, this link is much better, thanks. (Haven't had a chance to look at the
claims yet).
On Apr 29, 2014 12:10 PM, "Niklas Hambüchen"
Look, when I serve it via a different service, the report looks all fine:
https://rawgit.com/nh2/loop/master/results/bench-foldl-and-iorefs-are-slow.h...
I should probably switch it to hosting it via Github pages once I figure out how to do that comfortably.
This is a strong piece of evidence the the claims therein are fundamentally wrong, confusing and misleading and wrong.
Now back to the claims!

Int has a tight loop for `EnumFromTo`
http://git.haskell.org/packages/base.git/blob/HEAD:/GHC/Enum.lhs#l500
On the other hand look at `EnumFromTo` for word32
http://hackage.haskell.org/package/base-4.7.0.0/docs/src/GHC-Word.html#Word3...
It calls `integralEnumFromTo`
Which is
723http://git.haskell.org/packages/base.git/blob/52c0b09036c36f1ed928663abb2f29...integralEnumFromTo
:: Integral a => a -> a -> [a]
724http://git.haskell.org/packages/base.git/blob/52c0b09036c36f1ed928663abb2f29...integralEnumFromTo
n m = map fromInteger [toInteger n .. toInteger m]
That is right you get converted to an Integer, then back again.
http://git.haskell.org/packages/base.git/blob/52c0b09036c36f1ed928663abb2f29...
This seems to be the problem with all loops that use `Word32` and
`EnumFromTo`.
It looks like you could add a better `EnumFromTo` for `Word32` and you
would close the gap between the loops that are using `EnumFromTo` and those
that are not.
Patrick
On Tue, Apr 29, 2014 at 11:35 AM, Niklas Hambüchen
Interestingly, the discrimination against Word32 does not end here:
When compiling with -fllvm ( http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result... ), we can see that the forLoop + strict State monad is completely compiled away to a no-op for Int, but not for Word32. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Patrick Wheeler Patrick.John.Wheeler@gmail.com Patrick.J.Wheeler@rice.edu Patrick.Wheeler@colorado.edu

On Tue, Apr 29, 2014 at 9:31 AM, Niklas Hambüchen
This is just a short notice that using
foldl' (+) 0 [0..100000::Int]
is over 10 times slower than using
flip execState 0 $ forLoop (0 :: Int) (< n) (+1) $ \i -> do x <- get put $! x + i
with `loopFor` as on https://github.com/nh2/loop.
So this is interesting. The forLoop code gets compiled into a nice loop in core over unboxed Ints. The foldl' function OTOH still goes via a list. I expect it's not foldl' itself that's slow, rather that it doesn't fuse with the list producers in current GHCs. This may be improved in the future. Especially as the vectorized foldl' does fuse.
Even using an IORef is twice as fast as the pure foldl' (but still 5 times slower than strict State).
The benchmark is at
http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result... .
(All benchmarks are linked from https://github.com/nh2/loop.)
You can see there that the problem is gone when using Vector.foldl', but only for Int - for Word32 it persists.
It seems that manual looping is beneficial even when dealing with prime examples of pure code that GHC ought to optimize well.
That's why I suggested using vector in the first place. But it seems that Vector.enumFromTo could use some optimizations to help with some common cases. John

On 30/04/14 02:16, John Lato wrote:
So this is interesting. The forLoop code gets compiled into a nice loop in core over unboxed Ints. The foldl' function OTOH still goes via a list. I expect it's not foldl' itself that's slow, rather that it doesn't fuse with the list producers in current GHCs. This may be improved in the future. Especially as the vectorized foldl' does fuse.
It bothers me especially as the comments around `enumFromTo` seem to explicitly talk about deforestation.

Hi, Am Dienstag, den 29.04.2014, 18:16 -0700 schrieb John Lato:
So this is interesting. The forLoop code gets compiled into a nice loop in core over unboxed Ints. The foldl' function OTOH still goes via a list. I expect it's not foldl' itself that's slow, rather that it doesn't fuse with the list producers in current GHCs. This may be improved in the future. Especially as the vectorized foldl' does fuse.
This should have improved in GHC 7.9, where I have fixed https://ghc.haskell.org/trac/ghc/ticket/7994 (making foldl a good consumer). Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

On Wednesday 30 April 2014, 13:42:36, Joachim Breitner wrote:
Hi,
Am Dienstag, den 29.04.2014, 18:16 -0700 schrieb John Lato:
So this is interesting. The forLoop code gets compiled into a nice loop in core over unboxed Ints. The foldl' function OTOH still goes via a list. I expect it's not foldl' itself that's slow, rather that it doesn't fuse with the list producers in current GHCs. This may be improved in the future. Especially as the vectorized foldl' does fuse.
This should have improved in GHC 7.9, where I have fixed https://ghc.haskell.org/trac/ghc/ticket/7994 (making foldl a good consumer).
Confirmed, foldl' (+) 0 [1 .. 5000000 :: Int] gets the very nice core
Rec {
Main.$wgo [Occ=LoopBreaker]
:: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType

Hi, Am Dienstag, den 29.04.2014, 17:31 +0100 schrieb Niklas Hambüchen:
This is just a short notice that using
foldl' (+) 0 [0..100000::Int]
is over 10 times slower than using
flip execState 0 $ forLoop (0 :: Int) (< n) (+1) $ \i -> do x <- get put $! x + i
with `loopFor` as on https://github.com/nh2/loop.
Did you look at the GHC Core generated? When starting to investigate performance issues, reading Core simply is required – otherwise it’s like search for a lost plain while refusing to look under the water surface. So can you compile your code with -ddump-simpl (and possibly flags -dsuppress-uniques and -dsuppress-module-prefixes that make it more readable) and compare the results: Any obvious differences, like unboxing to Int# used in one place and not in the other? If you need help deciphering core, paste the relevant bits in a mail to this thread and we can help you. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

On 30/04/14 12:41, Joachim Breitner wrote:
Did you look at the GHC Core generated? When starting to investigate performance issues, reading Core simply is required – otherwise it’s like search for a lost plain while refusing to look under the water surface.
Of course. A short look with ghc-core on import Data.List main = print (foldl' (*) 0 [0..100000::Int]) reveals that the original list traversal is still present: $wlgo = \ (ww_s1HD :: Int#) (w_s1HF :: [Int]) -> case w_s1HF of _ { [] -> ww_s1HD; : x_asn xs_aso -> case x_asn of _ { I# y_asz -> $wlgo (*# ww_s1HD y_asz) xs_aso } } There is not much point investigating further until collecting the responses from this list - as it turned out, you already fixed the problem. Thanks! For performance-critical code, it seems sensible for me to stick to my `forLoop` for now, as it works quite well with the existing GHCs and much less brain seems to be necessary to get that compile down to fast code.

Hi, Am Mittwoch, den 30.04.2014, 15:13 +0100 schrieb Niklas Hambüchen:
For performance-critical code, it seems sensible for me to stick to my `forLoop` for now, as it works quite well with the existing GHCs and much less brain seems to be necessary to get that compile down to fast code.
I agree. I think one big problem is that RULES (e.g. list fusion) are great when they work, but it is very hard for the “normal” Haskell programmer to confirm that they work as expected, let alone predict when they work. I don’t have a good solution, though. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

Maybe some means of asserting that a rule is applied? Hopefully by
effect (this list should not appear in the core, or whatnot) rather
than by naming the particular rule...
On Wed, Apr 30, 2014 at 7:34 AM, Joachim Breitner
Hi,
Am Mittwoch, den 30.04.2014, 15:13 +0100 schrieb Niklas Hambüchen:
For performance-critical code, it seems sensible for me to stick to my `forLoop` for now, as it works quite well with the existing GHCs and much less brain seems to be necessary to get that compile down to fast code.
I agree. I think one big problem is that RULES (e.g. list fusion) are great when they work, but it is very hard for the “normal” Haskell programmer to confirm that they work as expected, let alone predict when they work. I don’t have a good solution, though.
Greetings, Joachim
-- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

@Niklas
You can use the `foldM` form the FoldL package to achieve equal results as
the fastest loops in your current benchmark. foldM will allow you to deal
with your loops at a high level of abstraction though. See the following
post, by Gabriel Gonzalez, for an example:
http://www.haskellforall.com/2013/08/foldl-100-composable-streaming-and.html
I have added the bench mark to my fork of your repo, and made a pull
request:
https://github.com/Davorak/loop
It looks like the only reason that `foldM` does not preform well with
`Word32` is because of the naive implementation of `enumFromTo` for Word32
as explained in my other email in more detail.
Here is the Criterion report:
http://htmlpreview.github.io/?https://github.com/Davorak/loop/blob/master/re...
On Tue, Apr 29, 2014 at 11:31 AM, Niklas Hambüchen
This is just a short notice that using
foldl' (+) 0 [0..100000::Int]
is over 10 times slower than using
flip execState 0 $ forLoop (0 :: Int) (< n) (+1) $ \i -> do x <- get put $! x + i
with `loopFor` as on https://github.com/nh2/loop.
Even using an IORef is twice as fast as the pure foldl' (but still 5 times slower than strict State).
The benchmark is at
http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result... .
(All benchmarks are linked from https://github.com/nh2/loop.)
You can see there that the problem is gone when using Vector.foldl', but only for Int - for Word32 it persists.
It seems that manual looping is beneficial even when dealing with prime examples of pure code that GHC ought to optimize well.
Niklas _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Patrick Wheeler Patrick.John.Wheeler@gmail.com Patrick.J.Wheeler@rice.edu Patrick.Wheeler@colorado.edu

Nice find, I merged your benchmark addition. Would you mind making a GHC issue from your `enumFromTo` findings? On 01/05/14 15:17, Patrick Wheeler wrote:
@Niklas
You can use the `foldM` form the FoldL package to achieve equal results as the fastest loops in your current benchmark. foldM will allow you to deal with your loops at a high level of abstraction though. See the following post, by Gabriel Gonzalez, for an example: http://www.haskellforall.com/2013/08/foldl-100-composable-streaming-and.html
I have added the bench mark to my fork of your repo, and made a pull request: https://github.com/Davorak/loop
It looks like the only reason that `foldM` does not preform well with `Word32` is because of the naive implementation of `enumFromTo` for Word32 as explained in my other email in more detail.
Here is the Criterion report: http://htmlpreview.github.io/?https://github.com/Davorak/loop/blob/master/re...
participants (8)
-
Carter Schonwald
-
Daniel Fischer
-
David Thomas
-
Felipe Lessa
-
Joachim Breitner
-
John Lato
-
Niklas Hambüchen
-
Patrick Wheeler