How to write fast for loops

As you probably now, `forM_ [1..n]` is incredibly slow in Haskell due to lack of list fusion [1]; a manually written loop is 10x faster. How do you people work around this? I keep defining myself something like loop :: (Monad m) => Int -> (Int -> m ()) -> m () loop bex f = go 0 where go !n | n == bex = return () | otherwise = f n >> go (n+1) Is there a function for this somewhere already? Or do you have another way to deal with this problem? Thanks! [1]: https://ghc.haskell.org/trac/ghc/ticket/8763

I usually use a manually written loop, but you can use Data.Vector for this
and it should fuse.
John L.
On Apr 26, 2014 2:41 PM, "Niklas Hambüchen"
As you probably now, `forM_ [1..n]` is incredibly slow in Haskell due to lack of list fusion [1]; a manually written loop is 10x faster.
How do you people work around this?
I keep defining myself something like
loop :: (Monad m) => Int -> (Int -> m ()) -> m () loop bex f = go 0 where go !n | n == bex = return () | otherwise = f n >> go (n+1)
Is there a function for this somewhere already? Or do you have another way to deal with this problem?
Thanks!
[1]: https://ghc.haskell.org/trac/ghc/ticket/8763 _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hey John, do you mean something along the lines of `V.forM_ (fromList [1..n])`? I guess GHC wouldn't mash away the list in this case and I'd have to use the new OverloadedLists with `V.forM_ [1..n]` (indeed that looks quite sweet). However, would that actually be fast? Judging from how list syntax is desugared, I guess that would use `V.enumFromTo` then, whose docs claim "WARNING: This operation can be very inefficient. If at all possible, use enumFromN instead". Does anybody know wonder if "vector literals" desugar enumFromN, or what "can be very inefficient" means? Niklas On Sun 27 Apr 2014 01:13:52 BST, John Lato wrote:
I usually use a manually written loop, but you can use Data.Vector for this and it should fuse.

Hi Niklas, No, I mean this:
V.forM_ (V.enumFromN 1 n)
I agree it's less elegant. And like Daniel points out with lists, you
definitely don't want the "enumFromN" to be shared. But one of the biggest
warts of Haskell IMHO is that it's still very difficult to get really
high-performance code without an intimate understanding of the optimizer.
John L.
On Sat, Apr 26, 2014 at 6:16 PM, Niklas Hambüchen
Hey John,
do you mean something along the lines of `V.forM_ (fromList [1..n])`? I guess GHC wouldn't mash away the list in this case and I'd have to use the new OverloadedLists with `V.forM_ [1..n]` (indeed that looks quite sweet). However, would that actually be fast? Judging from how list syntax is desugared, I guess that would use `V.enumFromTo` then, whose docs claim "WARNING: This operation can be very inefficient. If at all possible, use enumFromN instead".
Does anybody know wonder if "vector literals" desugar enumFromN, or what "can be very inefficient" means?
Niklas
On Sun 27 Apr 2014 01:13:52 BST, John Lato wrote:
I usually use a manually written loop, but you can use Data.Vector for this and it should fuse.

John, can't a careful use of the identity monad or the like prevent that
CSE? (i'm a bit fuzzy on that piece of ghc lore)
On Sat, Apr 26, 2014 at 10:28 PM, John Lato
Hi Niklas,
No, I mean this:
V.forM_ (V.enumFromN 1 n)
I agree it's less elegant. And like Daniel points out with lists, you definitely don't want the "enumFromN" to be shared. But one of the biggest warts of Haskell IMHO is that it's still very difficult to get really high-performance code without an intimate understanding of the optimizer.
John L.
On Sat, Apr 26, 2014 at 6:16 PM, Niklas Hambüchen
wrote: Hey John,
do you mean something along the lines of `V.forM_ (fromList [1..n])`? I guess GHC wouldn't mash away the list in this case and I'd have to use the new OverloadedLists with `V.forM_ [1..n]` (indeed that looks quite sweet). However, would that actually be fast? Judging from how list syntax is desugared, I guess that would use `V.enumFromTo` then, whose docs claim "WARNING: This operation can be very inefficient. If at all possible, use enumFromN instead".
Does anybody know wonder if "vector literals" desugar enumFromN, or what "can be very inefficient" means?
Niklas
On Sun 27 Apr 2014 01:13:52 BST, John Lato wrote:
I usually use a manually written loop, but you can use Data.Vector for this and it should fuse.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 27/04/14 03:28, John Lato wrote:
No, I mean this:
V.forM_ (V.enumFromN 1 n)
Hey John, unfortunately that's not fast, and it eats all the RAMs. Here comes the sad comparison of looping in Haskell: - loop (maxBound :: Word32) 3 seconds, constant space - forM_ [0..maxBound :: Word32] 36 seconds, constant space - V.forM_ (V.enumFromTo 0 (maxBound :: Word32)) 37 seconds, constant space - V.forM_ (V.enumFromN 0 (fromIntegral (maxBound :: Word32))) linear space -> crashes All loops execute `return ()`. :(

On Apr 27, 2014 1:10 PM, "Niklas Hambüchen"
On 27/04/14 03:28, John Lato wrote:
No, I mean this:
V.forM_ (V.enumFromN 1 n)
Hey John,
unfortunately that's not fast, and it eats all the RAMs.
Here comes the sad comparison of looping in Haskell:
- loop (maxBound :: Word32) 3 seconds, constant space
- forM_ [0..maxBound :: Word32] 36 seconds, constant space
- V.forM_ (V.enumFromTo 0 (maxBound :: Word32)) 37 seconds, constant space
- V.forM_ (V.enumFromN 0 (fromIntegral (maxBound :: Word32))) linear space -> crashes
All loops execute `return ()`.
Hmm. Are you using regular vectors or unboxed? Also what sort of crash? Is it a stack space overflow or are you on a 32-bit platform?

On 27/04/14 23:18, John Lato wrote:> Hmm. Are you using regular vectors or unboxed? Also what sort of crash?
Is it a stack space overflow or are you on a 32-bit platform?
I'm using regular Data.Vector as V. It's just eating my 8GB and then I kill it (or let my ulimit do it). It can't be the stack space since I'm on 7.6 where stack space is limited by default. It's on 64 bit Linux.

On Sun, Apr 27, 2014 at 3:37 PM, Niklas Hambüchen
On 27/04/14 23:18, John Lato wrote:> Hmm. Are you using regular vectors or unboxed? Also what sort of crash?
Is it a stack space overflow or are you on a 32-bit platform?
I'm using regular Data.Vector as V.
Are unboxed vectors faster? My rule of thumb is to use them over Data.Vector whenever possible.
It's just eating my 8GB and then I kill it (or let my ulimit do it). It can't be the stack space since I'm on 7.6 where stack space is limited by default.
It's on 64 bit Linux.
I would expect it's because you never force the argument. With `enumFromTo` the argument is forced because it needs to be checked for termination, but `enumFromN` is probably building up a big chain of thunks. I guess for this case `enumFromN` has no benefit over `enumFromTo` because the intention is to create a single loop instead of actually allocating the vector, so the warning in the documentation doesn't necessarily apply.

On 28/04/14 00:22, John Lato wrote:
Are unboxed vectors faster? My rule of thumb is to use them over Data.Vector whenever possible.
I haven't checked yet, but should it matter? Because my goal is that the vector never be created *at all*, and boxed or not shouldn't make a difference on that!
I would expect it's because you never force the argument. With `enumFromTo` the argument is forced because it needs to be checked for termination, but `enumFromN` is probably building up a big chain of thunks. I guess for this case `enumFromN` has no benefit over `enumFromTo` because the intention is to create a single loop instead of actually allocating the vector, so the warning in the documentation doesn't necessarily apply.
Also haven't checked that yet, but I suspect that instead of something thunk-related, the thing plainly allocates the vector. Just to clarify: `V.enumFromTo` works much better than `V.enumFromN` because in contrast to the latter it doesn't actually try to create the fully sized vector.

Niklas,
just for fun, and seeing as your goal is to make something that works
elegeantly and efficiently with list syntax, how about making a new type
that's an instance of OverloadedLists but never actually allocates
anything, so you can basically desugar "X.forM_ [a..b]" to your loop
function?
Conrad.
On 28 April 2014 09:49, Niklas Hambüchen
On 28/04/14 00:22, John Lato wrote:
Are unboxed vectors faster? My rule of thumb is to use them over Data.Vector whenever possible.
I haven't checked yet, but should it matter? Because my goal is that the vector never be created *at all*, and boxed or not shouldn't make a difference on that!
I would expect it's because you never force the argument. With `enumFromTo` the argument is forced because it needs to be checked for termination, but `enumFromN` is probably building up a big chain of thunks. I guess for this case `enumFromN` has no benefit over `enumFromTo` because the intention is to create a single loop instead of actually allocating the vector, so the warning in the documentation doesn't necessarily apply.
Also haven't checked that yet, but I suspect that instead of something thunk-related, the thing plainly allocates the vector.
Just to clarify: `V.enumFromTo` works much better than `V.enumFromN` because in contrast to the latter it doesn't actually try to create the fully sized vector. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hey Conrad! That sounds like a nice idea. I might have a look at it after I have actually figured out what the fastest/best way is (see the benchmarks I just posted). Niklas On 28/04/14 01:00, Conrad Parker wrote:
Niklas,
just for fun, and seeing as your goal is to make something that works elegeantly and efficiently with list syntax, how about making a new type that's an instance of OverloadedLists but never actually allocates anything, so you can basically desugar "X.forM_ [a..b]" to your loop function?
Conrad.

On Sun, Apr 27, 2014 at 4:49 PM, Niklas Hambüchen
On 28/04/14 00:22, John Lato wrote:
Are unboxed vectors faster? My rule of thumb is to use them over Data.Vector whenever possible.
I haven't checked yet, but should it matter? Because my goal is that the vector never be created *at all*, and boxed or not shouldn't make a difference on that!
It can make a difference in that, with unboxed vectors, the compiler can statically determine that it is able to use unboxed values, and therefore is more likely to do so. Having finally broken down and run some tests, I can report that on my system using V.enumFromTo with unboxed vectors results in the same performance as the hand-written loop.
I would expect it's because you never force the argument. With `enumFromTo` the argument is forced because it needs to be checked for termination, but `enumFromN` is probably building up a big chain of thunks. I guess for this case `enumFromN` has no benefit over `enumFromTo` because the intention is to create a single loop instead of actually allocating the vector, so the warning in the documentation doesn't necessarily apply.
Also haven't checked that yet, but I suspect that instead of something thunk-related, the thing plainly allocates the vector.
Just to clarify: `V.enumFromTo` works much better than `V.enumFromN` because in contrast to the latter it doesn't actually try to create the fully sized vector.
I believe that if you check this with ghc-7.6.3 and -O2, you will discover that my analysis is correct :) However, I like Conrad's suggestion, looks like an interesting library.

I've just uploaded my benchmarks to https://github.com/nh2/loop. Please take a look. There are some interesting results. The first thing I don't understand at all is: http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result... See how w32/loop and w32/unsafeLoop are equally fast. Then look at http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result... Here I run the same thing over the whole of Word32. See how `loop` is faster here than `unsafeLoop`? How does that make sense? Next thing: It seems that V.enumFromTo and V.fromList are actually the same: http://hackage.haskell.org/package/vector-0.10.9.1/docs/src/Data-Vector-Fusi... However in my benchmark, at least for `Int`, the performance is different - am I overlooking something? On 28/04/14 01:34, John Lato wrote:
It can make a difference in that, with unboxed vectors, the compiler can statically determine that it is able to use unboxed values, and therefore is more likely to do so. Having finally broken down and run some tests, I can report that on my system using V.enumFromTo with unboxed vectors results in the same performance as the hand-written loop.
I cannot see a difference between Vector.enumFromTo and Vector.Unboxed.enumFromTo in my benchmark. Vector.enumFromTo is as fast as the hand-written loop, but only for `Int`. For `Word32`, it is 5 times slower. Any idea why?
> I would expect it's because you never force the argument. With > `enumFromTo` the argument is forced because it needs to be checked for > termination, but `enumFromN` is probably building up a big chain of > thunks. I guess for this case `enumFromN` has no benefit over > `enumFromTo` because the intention is to create a single loop instead of > actually allocating the vector, so the warning in the documentation > doesn't necessarily apply.
Also haven't checked that yet, but I suspect that instead of something thunk-related, the thing plainly allocates the vector.
Just to clarify: `V.enumFromTo` works much better than `V.enumFromN` because in contrast to the latter it doesn't actually try to create the fully sized vector.
I believe that if you check this with ghc-7.6.3 and -O2, you will discover that my analysis is correct :)
Ok, I think I understand now what you mean.

On Sun, Apr 27, 2014 at 7:23 PM, Niklas Hambüchen
I've just uploaded my benchmarks to https://github.com/nh2/loop.
Please take a look. There are some interesting results.
The first thing I don't understand at all is:
http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result...
See how w32/loop and w32/unsafeLoop are equally fast. Then look at
http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result...
Here I run the same thing over the whole of Word32. See how `loop` is faster here than `unsafeLoop`? How does that make sense?
Huh? In the comments you wrote: -- Note that some types (e.g. Word32) have bounds checks even for -- `toEnum`. Doesn't that explain it? For Int, toEnum/fromEnum is a noop, but on Word32 it's not.
Next thing:
It seems that V.enumFromTo and V.fromList are actually the same:
http://hackage.haskell.org/package/vector-0.10.9.1/docs/src/Data-Vector-Fusi...
However in my benchmark, at least for `Int`, the performance is different - am I overlooking something?
Probably with V.fromList, the list gets floated out? Just guessing, check the core!
On 28/04/14 01:34, John Lato wrote:
It can make a difference in that, with unboxed vectors, the compiler can statically determine that it is able to use unboxed values, and therefore is more likely to do so. Having finally broken down and run some tests, I can report that on my system using V.enumFromTo with unboxed vectors results in the same performance as the hand-written loop.
I cannot see a difference between Vector.enumFromTo and Vector.Unboxed.enumFromTo in my benchmark.
Vector.enumFromTo is as fast as the hand-written loop, but only for `Int`. For `Word32`, it is 5 times slower. Any idea why?
Ahh, you made me look at the core again. I think this is related to your observation about V.enumFromTo being the same as V.fromList. With Word32 the generated core shows that this goes via a list representation instead of a nice loop. Which makes me suspect there's some RULE that applies to Stream.enumFromTo that is firing in the first case but not the second. And if I build both versions with -ddump-rule-firings, indeed I see that the Int version has Rule fired: enumFromTo<Int> [Stream] With nothing comparable for the Word32 version. I'd imagine if you grep for that in the Vector sources, you'd find something interesting. The EnumFromN version does not seem to suffer from this (but again it's necessary to evaluate the argument).

Has anyone written a blog post to the effect of "Haskell: The Slow Parts"?
Or a heuristic for reliably identifying them, maybe? It sounds as though I
could really benefit from it. I had no idea about forM.
On Sun, Apr 27, 2014 at 10:24 PM, John Lato
On Sun, Apr 27, 2014 at 7:23 PM, Niklas Hambüchen
wrote: I've just uploaded my benchmarks to https://github.com/nh2/loop.
Please take a look. There are some interesting results.
The first thing I don't understand at all is:
http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result...
See how w32/loop and w32/unsafeLoop are equally fast. Then look at
http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result...
Here I run the same thing over the whole of Word32. See how `loop` is faster here than `unsafeLoop`? How does that make sense?
Huh? In the comments you wrote:
-- Note that some types (e.g. Word32) have bounds checks even for -- `toEnum`.
Doesn't that explain it? For Int, toEnum/fromEnum is a noop, but on Word32 it's not.
Next thing:
It seems that V.enumFromTo and V.fromList are actually the same:
http://hackage.haskell.org/package/vector-0.10.9.1/docs/src/Data-Vector-Fusi...
However in my benchmark, at least for `Int`, the performance is different - am I overlooking something?
Probably with V.fromList, the list gets floated out? Just guessing, check the core!
On 28/04/14 01:34, John Lato wrote:
It can make a difference in that, with unboxed vectors, the compiler can statically determine that it is able to use unboxed values, and therefore is more likely to do so. Having finally broken down and run some tests, I can report that on my system using V.enumFromTo with unboxed vectors results in the same performance as the hand-written loop.
I cannot see a difference between Vector.enumFromTo and Vector.Unboxed.enumFromTo in my benchmark.
Vector.enumFromTo is as fast as the hand-written loop, but only for `Int`. For `Word32`, it is 5 times slower. Any idea why?
Ahh, you made me look at the core again. I think this is related to your observation about V.enumFromTo being the same as V.fromList. With Word32 the generated core shows that this goes via a list representation instead of a nice loop. Which makes me suspect there's some RULE that applies to Stream.enumFromTo that is firing in the first case but not the second. And if I build both versions with -ddump-rule-firings, indeed I see that the Int version has
Rule fired: enumFromTo<Int> [Stream]
With nothing comparable for the Word32 version. I'd imagine if you grep for that in the Vector sources, you'd find something interesting.
The EnumFromN version does not seem to suffer from this (but again it's necessary to evaluate the argument).
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

http://book.realworldhaskell.org/read/profiling-and-optimization.html
Would be a start
On Monday, 28 April 2014, Mark Thom
Has anyone written a blog post to the effect of "Haskell: The Slow Parts"? Or a heuristic for reliably identifying them, maybe? It sounds as though I could really benefit from it. I had no idea about forM.
On Sun, Apr 27, 2014 at 10:24 PM, John Lato
javascript:_e(%7B%7D,'cvml','jwlato@gmail.com'); wrote:
On Sun, Apr 27, 2014 at 7:23 PM, Niklas Hambüchen
javascript:_e(%7B%7D,'cvml','mail@nh2.me'); wrote:
I've just uploaded my benchmarks to https://github.com/nh2/loop.
Please take a look. There are some interesting results.
The first thing I don't understand at all is:
http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result...
See how w32/loop and w32/unsafeLoop are equally fast. Then look at
http://htmlpreview.github.io/?https://github.com/nh2/loop/blob/master/result...
Here I run the same thing over the whole of Word32. See how `loop` is faster here than `unsafeLoop`? How does that make sense?
Huh? In the comments you wrote:
-- Note that some types (e.g. Word32) have bounds checks even for
-- `toEnum`.
Doesn't that explain it? For Int, toEnum/fromEnum is a noop, but on Word32 it's not.
Next thing:
It seems that V.enumFromTo and V.fromList are actually the same:
http://hackage.haskell.org/package/vector-0.10.9.1/docs/src/Data-Vector-Fusi...
However in my benchmark, at least for `Int`, the performance is different - am I overlooking something?
Probably with V.fromList, the list gets floated out? Just guessing, check the core!
On 28/04/14 01:34, John Lato wrote:
It can make a difference in that, with unboxed vectors, the compiler can statically determine that it is able to use unboxed values, and therefore is more likely to do so. Having finally broken down and run some tests, I can report that on my system using V.enumFromTo with unboxed vectors results in the same performance as the hand-written loop.
I cannot see a difference between Vector.enumFromTo and Vector.Unboxed.enumFromTo in my benchmark.
Vector.enumFromTo is as fast as the hand-written loop, but only for `Int`. For `Word32`, it is 5 times slower. Any idea why?
Ahh, you made me look at the core again. I think this is related to your observation about V.enumFromTo being the same as V.fromList. With Word32 the generated core shows that this goes via a list representation instead of a nice loop. Which makes me suspect there's some RULE that applies to Stream.enumFromTo that is firing in the first case but not the second. And if I build both versions with -ddump-rule-firings, indeed I see that the Int version has
Rule fired: enumFromTo<Int> [Stream]
With nothing comparable for the Word32 version. I'd imagine if you grep for that in the Vector sources, you'd find something interesting.
The EnumFromN version does not seem to suffer from this (but again it's necessary to evaluate the argument).
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.orgjavascript:_e(%7B%7D,'cvml','Haskell-Cafe@haskell.org'); http://www.haskell.org/mailman/listinfo/haskell-cafe

After some evaluation, here come some results of my benchmarks, and the final function I've packaged: * It is tricky to write an Enum-based general loop function that is fast for all data types. The way overflow checks are implemented in different types makes e.g. Int and Word32 behave differently; Word32 is slower. * [a..b] is nice and can be just as fast as a manual loop - if you are lucky. Unfortunately, if you don't want to rely on luck for your program's performance, you can't have your [a..b] (at the time being). * Vector's equivalent of [a..b] *might* not be as luck dependent, but suffers from a factor 5 penalty, which is hopefully a but (see John's post). Current most appealing solution for fast loops ---------------------------------------------- Unfortunately, this seems to be the most robustly fast (across all types I have tested it with) loop: forLoop :: (Monad m) => a -> (a -> Bool) -> (a -> a) -> (a -> m ()) -> m () forLoop start cond inc f = go start where go !x | cond x = f x >> go (inc x) | otherwise = return () And you can use it as forLoop 0 (< n) (+1) $ \i -> ... Now, you probably don't like this, and neither do I, but it really looks like this is what you should use if you want to write high performance code :/ What also works is a `numLoop :: (Num a, Eq a, Monad m) => a -> a -> (a -> m ()) -> m ()` with start and end value like [a..b], but the `forLoop` aboves gives more flexibility and you can use either (+1) or `succ` with it depending on whether you want performance or overflow-safety. Does it really matter? ---------------------- One might think that in many cases the loop incrementing performance doesn't matter too much, since the loop body should dominate the time spent. That is true, but it is easy to get into cases where it doesn't. Some examples: * You want to write a test to make sure that reinterpret-casting a Word32 to Float and back gives you the same Word32 [1]. Using `forLoop` can make the difference if you have to wait 40 seconds for your test to pass or 3 seconds. (This is how I got into this topic.) * You want to implement something that does trivial operations on unboxed vectors, say dot product or matrix multiplication. `forLoop` can make a 5x performance difference. Shouldn't I write something this trivial ad-hoc? ------------------------------------------------ I've packed `forLoop` into https://github.com/nh2/loop and uploaded it to Hackage (http://hackage.haskell.org/package/loop). * Maintaining the fastest way to loop in one place frees one from thinking about it, and I plan to keep this updated with the fastest implementation known (contributions welcome). * It gives us a place to discuss alternatives and collect benchmarks for high-performance looping. Let's hope it becomes useless soon! [1]: http://stackoverflow.com/a/7002812/263061

On Tue, Apr 29, 2014 at 04:39:35AM +0100, Niklas Hambüchen wrote:
Current most appealing solution for fast loops ----------------------------------------------
Unfortunately, this seems to be the most robustly fast (across all types I have tested it with) loop:
forLoop :: (Monad m) => a -> (a -> Bool) -> (a -> a) -> (a -> m ()) -> m () forLoop start cond inc f = go start where go !x | cond x = f x >> go (inc x) | otherwise = return ()
Regarding the used stack space, wouldn't the following solution be even better? forLoop :: (Monad m) => a -> (a -> Bool) -> (a -> a) -> (a -> m ()) -> m () forLoop start cond inc f = go start (return ()) where go !x m | cond x = go (inc x, m >> f x) | otherwise = m Greetings, Daniel

Daniel Trstenjak wrote:
Regarding the used stack space, wouldn't the following solution be even better?
forLoop :: (Monad m) => a -> (a -> Bool) -> (a -> a) -> (a -> m ()) -> m () forLoop start cond inc f = go start (return ()) where go !x m | cond x = go (inc x, m >> f x) | otherwise = m
This version would build a thunk of shape (return () >> f x1 >> ... before any of the f xi calls get a chance to run; to make matters worse, evaluating that thunk *will* use the evaluation stack. The go !x | cond x = f x >> go (inc x) | otherwise = return () version is better. The guideline here is the strictness of (>>): in most monads, (>>) is strict in its first argument but lazy in its second one; furthermore, the second argument is used at most once in typical monad stacks. So the evaluation stack is not used up at all: Evaluating 'go x' will suspend the 'go (inc x)' call while 'f x' is being run; once 'f x' returns, the 'go (inc x)' is taken care of, which is now effectively a tail call. Laziness saves the day. Cheers, Bertram

On Tue, Apr 29, 2014 at 03:03:54PM +0200, Bertram Felgenhauer wrote:
This version would build a thunk of shape (return () >> f x1 >> ... before any of the f xi calls get a chance to run; to make matters worse, evaluating that thunk *will* use the evaluation stack.
Ok, I can see, that it has to build the whole thunk, because otherwise 'go' couldn't return anything. But why would the evaluation of the thunk use the stack? Why does the evaluation of '>>' in this case use the stack and in the other it doesn't?
The
go !x | cond x = f x >> go (inc x) | otherwise = return ()
version is better. The guideline here is the strictness of (>>): in most monads, (>>) is strict in its first argument but lazy in its second one; furthermore, the second argument is used at most once in typical monad stacks. So the evaluation stack is not used up at all: Evaluating 'go x' will suspend the 'go (inc x)' call while 'f x' is being run; once 'f x' returns, the 'go (inc x)' is taken care of, which is now effectively a tail call. Laziness saves the day.
I think that I was mostly irritated by 'mapM', which uses the stack, but that's because of it's use of '>>=', and it's bound value has to be kept on the stack. Thanks! Greetings, Daniel

On 28/04/14 05:24, John Lato wrote:
Doesn't that explain it? For Int, toEnum/fromEnum is a noop, but on Word32 it's not.
No, I'm not talking about the comparison between Word32 and Int: I'm looking at Word32 only, "w32/loop vs w32/unsafeLoop" in the one benchmark, and "loop vs unsafeLoop" in the other one. That's the same functions over the same data type, just that the second one iterates over more of that data type.
Ahh, you made me look at the core again. I think this is related to your observation about V.enumFromTo being the same as V.fromList. With Word32 the generated core shows that this goes via a list representation instead of a nice loop. Which makes me suspect there's some RULE that applies to Stream.enumFromTo that is firing in the first case but not the second. And if I build both versions with -ddump-rule-firings, indeed I see that the Int version has
Rule fired: enumFromTo<Int> [Stream]
With nothing comparable for the Word32 version. I'd imagine if you grep for that in the Vector sources, you'd find something interesting.
Nice observation. I filed this as https://github.com/haskell/vector/issues/21. Hopefully something comes out of it.

There has just been a fix for vector that improves on the bad performance of Word32 as compared to Int (https://github.com/haskell/vector/issues/21#issuecomment-44154305). The new benchmarks (https://rawgit.com/nh2/loop/master/results/bench-vector-bbd726d.html) now show Vector+Word32 on par, but functions in base, especially forM_, are still horrendously slow on Word32. On 28/04/14 05:24, John Lato wrote:
Ahh, you made me look at the core again. I think this is related to your observation about V.enumFromTo being the same as V.fromList. With Word32 the generated core shows that this goes via a list representation instead of a nice loop. Which makes me suspect there's some RULE that applies to Stream.enumFromTo that is firing in the first case but not the second. And if I build both versions with -ddump-rule-firings, indeed I see that the Int version has
Rule fired: enumFromTo<Int> [Stream]
With nothing comparable for the Word32 version. I'd imagine if you grep for that in the Vector sources, you'd find something interesting.
The EnumFromN version does not seem to suffer from this (but again it's necessary to evaluate the argument).

On Saturday 26 April 2014, 22:40:50, Niklas Hambüchen wrote:
As you probably now, `forM_ [1..n]` is incredibly slow in Haskell due to lack of list fusion [1];
It's not the lack of list fusion per se. If you have a forM_ [a .. b] $ \n -> do stuff where the type is Int, GHC is perfectly capable of eliminating the list and rewriting it to a loop (usually on par with a hand-written loop, although the core you get from a hand-written loop often is smaller and nicer to look at). The problem is that you use the same list multiple times, and GHC thinks "hey, let me re-use that", so it gets floated to the top-level, and the inner "loop" really uses the list, which apart from the `case`s on the list forces it to use boxed 'Int's instead of Int#. Then the boxed Ints need to be unboxed to be used, oops. If you manage to conceal the fact that the lists are the same from GHC, it eliminates the lists and you get fast code also with forM_. In your matmult example, using forM_ [k-k .. _SIZE-1] $ \j -> do ... does the trick for GHC-7.0 to 7.8. That is of course a brittle workaround, future GHCs might rewrite the k-k to 0 and share the list again.
a manually written loop is 10x faster.
How do you people work around this?
Do the idiomatic thing and write a loop ;)
Cheers, Daniel

Thanks for the insight! There's one thing I don't understand yet: On 27/04/14 02:32, Daniel Fischer wrote:
The problem is that you use the same list multiple times, and GHC thinks "hey, let me re-use that", so it gets floated to the top-level, and the inner "loop" really uses the list, which apart from the `case`s on the list forces it to use boxed 'Int's instead of Int#. Then the boxed Ints need to be unboxed to be used, oops.
When you say that the problem is that I "use the same list multiple times", do you mean that I use the actual syntactic expression `[0.._SIZE-1]` multiple times on multiple lines, and suggest that this should not happen if I use it in only one place? If so, how far does that go, and would GHC even notice that it is the same as a potential `[0.._511]` I might write elsewhere?
Do the idiomatic thing and write a loop ;)
Unfortunately, `forM_ [1..n]` is pretty much the most idiomatic and beautiful way I can come up with when ask for an iteration over 1 up to n! So this better be fixed :)

On Sunday 27 April 2014, 02:46:07, Niklas Hambüchen wrote:
Thanks for the insight! There's one thing I don't understand yet:
On 27/04/14 02:32, Daniel Fischer wrote:
The problem is that you use the same list multiple times, and GHC thinks "hey, let me re-use that", so it gets floated to the top-level, and the inner "loop" really uses the list, which apart from the `case`s on the list forces it to use boxed 'Int's instead of Int#. Then the boxed Ints need to be unboxed to be used, oops.
When you say that the problem is that I "use the same list multiple times", do you mean that I use the actual syntactic expression `[0.._SIZE-1]` multiple times on multiple lines, and suggest that this should not happen if I use it in only one place?
More or less. The `_SIZE - 1` is constant-folded to 511, so writing the list as [0 .. 511] would probably not prevent the sharing. And using the same list in a different function may or may not affect the produced code, I don't know GHC well enough to predict that. If you write it in a form that GHC doesn't recognise as the same value, GHC can and does optimise it as if the value were used only once.
If so, how far does that go, and would GHC even notice that it is the same as a potential `[0.._511]` I might write elsewhere?
Depends on "elsewhere". I've seen GHC produce multiple top-level values for the very same thing used in different functions in the same module, so if "elsewhere" means a different function and inlining doesn't cause GHC to see the two at the same time, it will probably not notice. But ask somebody who knows how GHC works if you want to be sure.
Do the idiomatic thing and write a loop ;)
Unfortunately, `forM_ [1..n]` is pretty much the most idiomatic and beautiful way I can come up with when ask for an iteration over 1 up to n!
Well, my idiom is looping, because.
So this better be fixed :)
Agreed.

On Sun, Apr 27, 2014 at 03:32:12AM +0200, Daniel Fischer wrote:
On Saturday 26 April 2014, 22:40:50, Niklas Hambüchen wrote:
As you probably now, `forM_ [1..n]` is incredibly slow in Haskell due to lack of list fusion [1];
It's not the lack of list fusion per se. If you have a
forM_ [a .. b] $ \n -> do stuff
where the type is Int, GHC is perfectly capable of eliminating the list and rewriting it to a loop (usually on par with a hand-written loop, although the core you get from a hand-written loop often is smaller and nicer to look at).
The problem is that you use the same list multiple times, and GHC thinks "hey, let me re-use that", so it gets floated to the top-level
If this is the case, does -fno-full-laziness help?

On Sunday 27 April 2014, 08:47:32, Tom Ellis wrote:
On Sun, Apr 27, 2014 at 03:32:12AM +0200, Daniel Fischer wrote:
The problem is that you use the same list multiple times, and GHC thinks "hey, let me re-use that", so it gets floated to the top-level
If this is the case, does -fno-full-laziness help?
Yes and no. It prevents the list from being floated to the top-level, but the overall code you get is slower than with the shared list. So it works as far as preventing unwanted sharing is concerned, but it inhibits optimisations at other places.
participants (11)
-
Bertram Felgenhauer
-
Carter Schonwald
-
Conrad Parker
-
Daniel Fischer
-
Daniel Trstenjak
-
Don Stewart
-
John Lato
-
Mark Thom
-
Niklas Hambüchen
-
Tom Ellis
-
Yitzchak Gale