forLoop + strict State monad is much faster than foldl'

newer
Prevent replicateM_ from repeating...

older
ANN: hledger-0.23, hledger-web-0.23

Niklas Hambüchen

29 Apr 2014 29 Apr '14

4:31 p.m.

This is just a short notice that using foldl' (+) 0 [0..100000::Int] is over 10 times slower than using flip execState 0 $ forLoop (0 :: Int) (< n) (+1) $ \i -> do x <- get put $! x + i with `loopFor` as on https://github.com/nh2/loop. Even using an IORef is twice as fast as the pure foldl' (but still 5 times slower than strict State). The benchmark is at http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result.... (All benchmarks are linked from https://github.com/nh2/loop.) You can see there that the problem is gone when using Vector.foldl', but only for Int - for Word32 it persists. It seems that manual looping is beneficial even when dealing with prime examples of pure code that GHC ought to optimize well. Niklas

Show replies by date

Niklas Hambüchen

29 Apr 29 Apr

4:35 p.m.

New subject: forLoop + strict State monad is much faster than foldl'

Interestingly, the discrimination against Word32 does not end here: When compiling with -fllvm (http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result...), we can see that the forLoop + strict State monad is completely compiled away to a no-op for Int, but not for Word32.

Felipe Lessa

4:43 p.m.

New subject: forLoop + strict State monad is much faster than foldl'

Em 29-04-2014 13:35, Niklas Hambüchen escreveu:

...

(http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result...),

Is it just or is criterion broken in some way? It considers a 1140% effect on the mean "moderate", so I would guess that the number is being incorrectly printed. One of the tests even has a "severe" 7280% effect! Also, for "sum32/sumIntStrictState" is reports microseconds on the charts but seconds on the table. Cheers, =) -- Felipe.

Niklas Hambüchen

4:53 p.m.

New subject: forLoop + strict State monad is much faster than foldl'

On 29/04/14 17:43, Felipe Lessa wrote:

...

Also, for "sum32/sumIntStrictState" is reports microseconds on the charts but seconds on the table.

I have noticed these things before as well, but I don't know what causes them and it only seems to happen sometimes :/ I filed it as https://github.com/bos/criterion/issues/50

John Lato

6:31 p.m.

Regardless, that report is garbage. It's not worth the paper it's printed on! Maybe re-run it and upload new results? On Tue, Apr 29, 2014 at 9:53 AM, Niklas Hambüchen wrote:

...

On 29/04/14 17:43, Felipe Lessa wrote:

...
Also, for "sum32/sumIntStrictState" is reports microseconds on the charts but seconds on the table.

I have noticed these things before as well, but I don't know what causes them and it only seems to happen sometimes :/

I filed it as https://github.com/bos/criterion/issues/50 _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Felipe Lessa

6:34 p.m.

New subject: forLoop + strict State monad is much faster than foldl'

Em 29-04-2014 15:31, John Lato escreveu:

...

Regardless, that report is garbage. It's not worth the paper it's printed on! Maybe re-run it and upload new results?

That's very harsh, why is it garbage? Cheers, =) -- Felipe.

Carter Schonwald

6:55 p.m.

because the events are on millisecond and microsecond scales, yet the means are being reported on the order of seconds. This is a strong piece of evidence the the claims therein are fundamentally wrong, confusing and misleading and wrong. On Tue, Apr 29, 2014 at 2:34 PM, Felipe Lessa wrote:

...

Em 29-04-2014 15:31, John Lato escreveu:

...
Regardless, that report is garbage. It's not worth the paper it's printed on! Maybe re-run it and upload new results?

That's very harsh, why is it garbage?

Cheers, =)

-- Felipe.

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Niklas Hambüchen

7 p.m.

New subject: forLoop + strict State monad is much faster than foldl'

I think we are dealing with a much more trivial, technical issue here: It looks like http://htmlpreview.github.io and Criterion's Javascript do not play well together. If I wget the raw file, https://raw.githubusercontent.com/nh2/loop/master/results/bench-foldl-and-io..., and view it in my browser, all the number reported are suddenly correct. Can you check if that works for you? On 29/04/14 19:55, Carter Schonwald wrote:

...

because the events are on millisecond and microsecond scales, yet the means are being reported on the order of seconds.

This is a strong piece of evidence the the claims therein are fundamentally wrong, confusing and misleading and wrong.

Niklas Hambüchen

7:10 p.m.

New subject: forLoop + strict State monad is much faster than foldl'

Look, when I serve it via a different service, the report looks all fine: https://rawgit.com/nh2/loop/master/results/bench-foldl-and-iorefs-are-slow.h... I should probably switch it to hosting it via Github pages once I figure out how to do that comfortably.

...

This is a strong piece of evidence the the claims therein are fundamentally wrong, confusing and misleading and wrong.

Now back to the claims!

John Lato

8:59 p.m.

Yes, this link is much better, thanks. (Haven't had a chance to look at the claims yet). On Apr 29, 2014 12:10 PM, "Niklas Hambüchen" wrote:

...

Look, when I serve it via a different service, the report looks all fine:

https://rawgit.com/nh2/loop/master/results/bench-foldl-and-iorefs-are-slow.h...

I should probably switch it to hosting it via Github pages once I figure out how to do that comfortably.

...
This is a strong piece of evidence the the claims therein are fundamentally wrong, confusing and misleading and wrong.

Now back to the claims!

Patrick Wheeler

1 May 1 May

2 p.m.

Int has a tight loop for `EnumFromTo` http://git.haskell.org/packages/base.git/blob/HEAD:/GHC/Enum.lhs#l500 On the other hand look at `EnumFromTo` for word32 http://hackage.haskell.org/package/base-4.7.0.0/docs/src/GHC-Word.html#Word3... It calls `integralEnumFromTo` Which is 723http://git.haskell.org/packages/base.git/blob/52c0b09036c36f1ed928663abb2f29...integralEnumFromTo :: Integral a => a -> a -> [a] 724http://git.haskell.org/packages/base.git/blob/52c0b09036c36f1ed928663abb2f29...integralEnumFromTo n m = map fromInteger [toInteger n .. toInteger m] That is right you get converted to an Integer, then back again. http://git.haskell.org/packages/base.git/blob/52c0b09036c36f1ed928663abb2f29... This seems to be the problem with all loops that use `Word32` and `EnumFromTo`. It looks like you could add a better `EnumFromTo` for `Word32` and you would close the gap between the loops that are using `EnumFromTo` and those that are not. Patrick On Tue, Apr 29, 2014 at 11:35 AM, Niklas Hambüchen wrote:

...

Interestingly, the discrimination against Word32 does not end here:

When compiling with -fllvm ( http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result... ), we can see that the forLoop + strict State monad is completely compiled away to a no-op for Int, but not for Word32. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Patrick Wheeler Patrick.John.Wheeler@gmail.com Patrick.J.Wheeler@rice.edu Patrick.Wheeler@colorado.edu

John Lato

30 Apr 30 Apr

1:16 a.m.

On Tue, Apr 29, 2014 at 9:31 AM, Niklas Hambüchen wrote:

...

This is just a short notice that using

foldl' (+) 0 [0..100000::Int]

is over 10 times slower than using

flip execState 0 $ forLoop (0 :: Int) (< n) (+1) $ \i -> do x <- get put $! x + i

with `loopFor` as on https://github.com/nh2/loop.

So this is interesting. The forLoop code gets compiled into a nice loop in core over unboxed Ints. The foldl' function OTOH still goes via a list. I expect it's not foldl' itself that's slow, rather that it doesn't fuse with the list producers in current GHCs. This may be improved in the future. Especially as the vectorized foldl' does fuse.

...

Even using an IORef is twice as fast as the pure foldl' (but still 5 times slower than strict State).

The benchmark is at

http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result... .

(All benchmarks are linked from https://github.com/nh2/loop.)

You can see there that the problem is gone when using Vector.foldl', but only for Int - for Word32 it persists.

It seems that manual looping is beneficial even when dealing with prime examples of pure code that GHC ought to optimize well.

That's why I suggested using vector in the first place. But it seems that Vector.enumFromTo could use some optimizations to help with some common cases. John

Niklas Hambüchen

2:04 a.m.

New subject: forLoop + strict State monad is much faster than foldl'

On 30/04/14 02:16, John Lato wrote:

...

So this is interesting. The forLoop code gets compiled into a nice loop in core over unboxed Ints. The foldl' function OTOH still goes via a list. I expect it's not foldl' itself that's slow, rather that it doesn't fuse with the list producers in current GHCs. This may be improved in the future. Especially as the vectorized foldl' does fuse.

It bothers me especially as the comments around `enumFromTo` seem to explicitly talk about deforestation.

Joachim Breitner

11:42 a.m.

New subject: forLoop + strict State monad is much faster than foldl'

Hi, Am Dienstag, den 29.04.2014, 18:16 -0700 schrieb John Lato:

...

So this is interesting. The forLoop code gets compiled into a nice loop in core over unboxed Ints. The foldl' function OTOH still goes via a list. I expect it's not foldl' itself that's slow, rather that it doesn't fuse with the list producers in current GHCs. This may be improved in the future. Especially as the vectorized foldl' does fuse.

This should have improved in GHC 7.9, where I have fixed https://ghc.haskell.org/trac/ghc/ticket/7994 (making foldl a good consumer). Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

Daniel Fischer

12:35 p.m.

On Wednesday 30 April 2014, 13:42:36, Joachim Breitner wrote:

...

Hi,

Am Dienstag, den 29.04.2014, 18:16 -0700 schrieb John Lato:

...
So this is interesting. The forLoop code gets compiled into a nice loop in core over unboxed Ints. The foldl' function OTOH still goes via a list. I expect it's not foldl' itself that's slow, rather that it doesn't fuse with the list producers in current GHCs. This may be improved in the future. Especially as the vectorized foldl' does fuse.

This should have improved in GHC 7.9, where I have fixed https://ghc.haskell.org/trac/ghc/ticket/7994 (making foldl a good consumer).

Confirmed, foldl' (+) 0 [1 .. 5000000 :: Int] gets the very nice core Rec { Main.$wgo [Occ=LoopBreaker] :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# [GblId, Arity=2, Caf=NoCafRefs, Str=DmdType ] Main.$wgo = \ (w :: GHC.Prim.Int#) (ww :: GHC.Prim.Int#) -> case w of wild { __DEFAULT -> Main.$wgo (GHC.Prim.+# wild 1) (GHC.Prim.+# ww wild); 5000000 -> GHC.Prim.+# ww 5000000 } end Rec } with 7.9.20140425. The core for Integer is naturally not quite as nice, but still the list is eliminated. Thanks, Joachim. Cheers, Daniel

Joachim Breitner

11:41 a.m.

New subject: forLoop + strict State monad is much faster than foldl'

Hi, Am Dienstag, den 29.04.2014, 17:31 +0100 schrieb Niklas Hambüchen:

...

This is just a short notice that using

foldl' (+) 0 [0..100000::Int]

is over 10 times slower than using

flip execState 0 $ forLoop (0 :: Int) (< n) (+1) $ \i -> do x <- get put $! x + i

with `loopFor` as on https://github.com/nh2/loop.

Did you look at the GHC Core generated? When starting to investigate performance issues, reading Core simply is required – otherwise it’s like search for a lost plain while refusing to look under the water surface. So can you compile your code with -ddump-simpl (and possibly flags -dsuppress-uniques and -dsuppress-module-prefixes that make it more readable) and compare the results: Any obvious differences, like unboxing to Int# used in one place and not in the other? If you need help deciphering core, paste the relevant bits in a mail to this thread and we can help you. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

Niklas Hambüchen

2:13 p.m.

New subject: forLoop + strict State monad is much faster than foldl'

On 30/04/14 12:41, Joachim Breitner wrote:

...

Did you look at the GHC Core generated? When starting to investigate performance issues, reading Core simply is required – otherwise it’s like search for a lost plain while refusing to look under the water surface.

Of course. A short look with ghc-core on import Data.List main = print (foldl' (*) 0 [0..100000::Int]) reveals that the original list traversal is still present: $wlgo = \ (ww_s1HD :: Int#) (w_s1HF :: [Int]) -> case w_s1HF of _ { [] -> ww_s1HD; : x_asn xs_aso -> case x_asn of _ { I# y_asz -> $wlgo (*# ww_s1HD y_asz) xs_aso } } There is not much point investigating further until collecting the responses from this list - as it turned out, you already fixed the problem. Thanks! For performance-critical code, it seems sensible for me to stick to my `forLoop` for now, as it works quite well with the existing GHCs and much less brain seems to be necessary to get that compile down to fast code.

Joachim Breitner

2:34 p.m.

New subject: forLoop + strict State monad is much faster than foldl'

Hi, Am Mittwoch, den 30.04.2014, 15:13 +0100 schrieb Niklas Hambüchen:

...

For performance-critical code, it seems sensible for me to stick to my `forLoop` for now, as it works quite well with the existing GHCs and much less brain seems to be necessary to get that compile down to fast code.

I agree. I think one big problem is that RULES (e.g. list fusion) are great when they work, but it is very hard for the “normal” Haskell programmer to confirm that they work as expected, let alone predict when they work. I don’t have a good solution, though. Greetings, Joachim -- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

David Thomas

2:57 p.m.

Maybe some means of asserting that a rule is applied? Hopefully by effect (this list should not appear in the core, or whatnot) rather than by naming the particular rule... On Wed, Apr 30, 2014 at 7:34 AM, Joachim Breitner wrote:

...

Hi,

Am Mittwoch, den 30.04.2014, 15:13 +0100 schrieb Niklas Hambüchen:

...
For performance-critical code, it seems sensible for me to stick to my `forLoop` for now, as it works quite well with the existing GHCs and much less brain seems to be necessary to get that compile down to fast code.

I agree. I think one big problem is that RULES (e.g. list fusion) are great when they work, but it is very hard for the “normal” Haskell programmer to confirm that they work as expected, let alone predict when they work. I don’t have a good solution, though.

Greetings, Joachim

-- Joachim “nomeata” Breitner mail@joachim-breitner.de • http://www.joachim-breitner.de/ Jabber: nomeata@joachim-breitner.de • GPG-Key: 0xF0FBF51F Debian Developer: nomeata@debian.org

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Patrick Wheeler

1 May 1 May

2:17 p.m.

@Niklas You can use the `foldM` form the FoldL package to achieve equal results as the fastest loops in your current benchmark. foldM will allow you to deal with your loops at a high level of abstraction though. See the following post, by Gabriel Gonzalez, for an example: http://www.haskellforall.com/2013/08/foldl-100-composable-streaming-and.html I have added the bench mark to my fork of your repo, and made a pull request: https://github.com/Davorak/loop It looks like the only reason that `foldM` does not preform well with `Word32` is because of the naive implementation of `enumFromTo` for Word32 as explained in my other email in more detail. Here is the Criterion report: http://htmlpreview.github.io/?https://github.com/Davorak/loop/blob/master/re... On Tue, Apr 29, 2014 at 11:31 AM, Niklas Hambüchen wrote:

...

This is just a short notice that using

foldl' (+) 0 [0..100000::Int]

is over 10 times slower than using

flip execState 0 $ forLoop (0 :: Int) (< n) (+1) $ \i -> do x <- get put $! x + i

with `loopFor` as on https://github.com/nh2/loop.

Even using an IORef is twice as fast as the pure foldl' (but still 5 times slower than strict State).

The benchmark is at

http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result... .

(All benchmarks are linked from https://github.com/nh2/loop.)

You can see there that the problem is gone when using Vector.foldl', but only for Int - for Word32 it persists.

It seems that manual looping is beneficial even when dealing with prime examples of pure code that GHC ought to optimize well.

Niklas _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Patrick Wheeler Patrick.John.Wheeler@gmail.com Patrick.J.Wheeler@rice.edu Patrick.Wheeler@colorado.edu

Niklas Hambüchen

5:40 p.m.

New subject: forLoop + strict State monad is much faster than foldl'

Nice find, I merged your benchmark addition. Would you mind making a GHC issue from your `enumFromTo` findings? On 01/05/14 15:17, Patrick Wheeler wrote:

...

@Niklas

You can use the `foldM` form the FoldL package to achieve equal results as the fastest loops in your current benchmark. foldM will allow you to deal with your loops at a high level of abstraction though. See the following post, by Gabriel Gonzalez, for an example: http://www.haskellforall.com/2013/08/foldl-100-composable-streaming-and.html

I have added the bench mark to my fork of your repo, and made a pull request: https://github.com/Davorak/loop

It looks like the only reason that `foldM` does not preform well with `Word32` is because of the naive implementation of `enumFromTo` for Word32 as explained in my other email in more detail.

Here is the Criterion report: http://htmlpreview.github.io/?https://github.com/Davorak/loop/blob/master/re...

4079

Age (days ago)

4081

Last active (days ago)

List overview

Download

20 comments

8 participants

participants (8)

Carter Schonwald
Daniel Fischer
David Thomas
Felipe Lessa
Joachim Breitner
John Lato
Niklas Hambüchen
Patrick Wheeler

forLoop + strict State monad is much faster than foldl'

tags

participants (8)