When, if ever, does Haskell "calculate once"?

older
Question about FRP trigger based...

Travis Erdman

6 May 2010 6 May '10

6:35 p.m.

Two questions here, I think they might be related, perhaps even the same, but I am not sure, so I will ask both: Q1: Re the function f below, I like the first implementation as it's "cleaner", but is the second implementation necessary for performance purposes? -- g = some CPU intensive function -- alternate 1 f a b = c + (g b) where c = dosomethingelse a (g b) -- alternate 2 f a b = c + saveit where saveit = g b c = dosomethingelse a saveit Q2: Imagine that I might be creating a custom data structure. I need to access both the data in that structure, but also "features" (don't know proper comp sci term) of the data structure. For instance, consider a Haskell list. The stuff in the list is the data, whereas the length of the list is what I am calling a "feature" here. Some of my features might be quite a bit more computationally intensive than "length", and may be referenced multiple times in many different places in my code. For performance purposes, should I consider modifying my data structure to embed these "features" as part of the data to ensure that it is only calculated once? Or can I rely on Haskell to do that for me? For instance, do I need to create the equivalent of: data MyList a = MyList {mylist::[a], mylength::Int} Thanks again for all your generous advice.

Attachments:

attachment.html (text/html — 1.8 KB)

Show replies by date

Daniel Fischer

6 May 6 May

7:30 p.m.

On Thursday 06 May 2010 20:35:15, Travis Erdman wrote:

...

Two questions here, I think they might be related, perhaps even the same, but I am not sure, so I will ask both:

Q1: Re the function f below, I like the first implementation as it's "cleaner", but is the second implementation necessary for performance purposes?

-- g = some CPU intensive function

-- alternate 1 f a b = c + (g b) where c = dosomethingelse a (g b)

A Haskell implementation *might* calculate (g b) only once, but very probably it will calculate it twice.

...

-- alternate 2 f a b = c + saveit where saveit = g b c = dosomethingelse a saveit

A Haskell implementation is allowed to calculate saveit twice, but a decent one won't. If you want to share the result of a computation, give a name to it. One reason why sharing the common subexpression (g b) in the first snippet is not done is that it often is a terrible idea. If, for example, g b is a large list, sharing hogs a lot of memory. If dosomethingelse and (+) consume that list in a suitable manner, only a small portion of it need be in memory at any given time. Of course, if (g b) takes only very little memory (certainly if it is an Int, Word, ..., Integer [almost certainly]), automatic sharing would be a good idea. However, looking for common subexpressions to share depending on their type would be even more complicated than looking for all common subexpressions, so probably nobody implemented it.

...

Q2: Imagine that I might be creating a custom data structure. I need to access both the data in that structure, but also "features" (don't know proper comp sci term) of the data structure. For instance, consider a Haskell list. The stuff in the list is the data, whereas the length of the list is what I am calling a "feature" here. Some of my features might be quite a bit more computationally intensive than "length", and may be referenced multiple times in many different places in my code. For performance purposes, should I consider modifying my data structure to embed these "features" as part of the data to ensure that it is only calculated once?

Yes, unless these features eat too much memory (then recalculating might be cheaper). If you need the features only for the big total, a wrapper analogous to MyList below is a good choice data FeatureThing = FT { feature1, feature2 :: Int, feature3 :: Bool, ..., thing :: Thing } , otherwise incorporate the features directly in Thing (like the size is incorporated in Set/Map).

...

Or can I rely on Haskell to do that for me?

No, you can't rely on that.

...

For instance, do I need to create the equivalent of:

data MyList a = MyList {mylist::[a], mylength::Int}

Thanks again for all your generous advice.

Brent Yorgey

7:37 p.m.

On Thu, May 06, 2010 at 11:35:15AM -0700, Travis Erdman wrote:

...

Two questions here, I think they might be related, perhaps even the same, but I am not sure, so I will ask both:

Q1: Re the function f below, I like the first implementation as it's "cleaner", but is the second implementation necessary for performance purposes?

-- g = some CPU intensive function

-- alternate 1 f a b = c + (g b) where c = dosomethingelse a (g b) -- alternate 2 f a b = c + saveit where saveit = g b c = dosomethingelse a saveit

You need alternative 2. In general, GHC (and, I imagine, other Haskell compilers) do not do much common subexpression elimination, because in some cases it can be a *pessimization*. The classic example is length [1..1000000] + length [1..1000000] vs let a = [1..1000000] in length a + length a The first will execute in constant space, since each list will be lazily generated as needed by the length function and then the garbage collector will come along behind length and get rid of the nodes that have already been processed. However, in the second expression, the garbage collector cannot get rid of the nodes that are already processed by the first call to length, since the second call to length still needs the list. So the entire list [1..1000000] will end up being in memory at once. So, if you want to be sure that something is only computed once, you must introduce the necessary sharing yourself by giving it a name.

...

Q2: Imagine that I might be creating a custom data structure. I need to access both the data in that structure, but also "features" (don't know proper comp sci term) of the data structure. For instance, consider a Haskell list. The stuff in the list is the data, whereas the length of the list is what I am calling a "feature" here. Some of my features might be quite a bit more computationally intensive than "length", and may be referenced multiple times in many different places in my code. For performance purposes, should I consider modifying my data structure to embed these "features" as part of the data to ensure that it is only calculated once? Or can I rely on Haskell to do that for me? For instance, do I need to create the equivalent of:

data MyList a = MyList {mylist::[a], mylength::Int}

There's no magic going on here, if you call a function to compute some complicated feature of a data structure multiple places in your code, it will be computed multiple times, just like in any other language. Caching the features you need as in the above example is a good idea if the data structures won't change often, and you really do need the features many times. -Brent

Maciej Piechotka

8:49 p.m.

On Thu, 2010-05-06 at 15:37 -0400, Brent Yorgey wrote:

...

...
Two questions here, I think they might be related, perhaps even the same, but I am not sure, so I will ask both:

Q1: Re the function f below, I like the first implementation as it's "cleaner", but is the second implementation necessary for

On Thu, May 06, 2010 at 11:35:15AM -0700, Travis Erdman wrote: performance purposes?

...
-- g = some CPU intensive function

-- alternate 1 f a b = c + (g b) where c = dosomethingelse a (g b)

-- alternate 2 f a b = c + saveit where saveit = g b c = dosomethingelse a saveit

You need alternative 2. In general, GHC (and, I imagine, other Haskell compilers) do not do much common subexpression elimination, because in some cases it can be a *pessimization*. The classic example is

length [1..1000000] + length [1..1000000]

vs

let a = [1..1000000] in length a + length a

The first will execute in constant space, since each list will be lazily generated as needed by the length function and then the garbage collector will come along behind length and get rid of the nodes that have already been processed. However, in the second expression, the garbage collector cannot get rid of the nodes that are already processed by the first call to length, since the second call to length still needs the list. So the entire list [1..1000000] will end up being in memory at once.

Hmm:

...

cFoldr :: (a -> b -> b) -> (a -> c -> c) -> (b, c) -> [a] -> (b, c) cFoldr f g ~(b, c) [] = (b, c) cFoldr f g ~(b, c) (x:xs) = let (b', c') = cFoldr f g (b, c) xs in (f x b', g x c')

cFoldl' :: (b -> a -> b) -> (c -> a -> c) -> (b, c) -> [a] -> (b, c) cFoldl' f g bc xs0 = lgo bc xs0 where lgo (b, c) [] = (b, c) lgo (b, c) (x:xs) = let b' = f b x c' = g c x bc' = b' `seq` c' `seq` (b', c') in bc' `seq` lgo bc' xs

lengthFold :: Int -> a -> Int lengthFold !n _ = n + 1

size = 100000000

...

main = let a = [1..size] l = length a + length a in print $! l

200000000 8,031,190,480 bytes allocated in the heap 741,264 bytes copied during GC 1,920 bytes maximum residency (1 sample(s)) 28,216 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 15318 collections, 0 parallel, 0.14s, 0.18s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 7.15s ( 7.26s elapsed) GC time 0.14s ( 0.18s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.29s ( 7.44s elapsed) %GC time 1.9% (2.5% elapsed) Alloc rate 1,123,257,562 bytes per MUT second Productivity 98.1% of total user, 96.1% of total elapsed ./a.out +RTS -s 7.29s user 0.05s system 98% cpu 7.450 total

...

main = print $! length [1..size] + length [1..size]

200000000 16,062,318,576 bytes allocated in the heap 1,476,904 bytes copied during GC 2,000 bytes maximum residency (1 sample(s)) 27,992 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 30637 collections, 0 parallel, 0.29s, 0.37s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 15.57s ( 15.90s elapsed) GC time 0.29s ( 0.37s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 15.87s ( 16.27s elapsed) %GC time 1.8% (2.3% elapsed) Alloc rate 1,031,313,475 bytes per MUT second Productivity 98.2% of total user, 95.7% of total elapsed ./a.out +RTS -s 15.87s user 0.11s system 98% cpu 16.272 total

...

main = print $! uncurry (+) (cFoldl' lengthFold lengthFold (0, 0) [1..size])

200000000 13,652,979,960 bytes allocated in the heap 2,089,600 bytes copied during GC 2,056 bytes maximum residency (1 sample(s)) 27,984 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 26041 collections, 0 parallel, 0.30s, 0.39s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 21.92s ( 23.44s elapsed) GC time 0.30s ( 0.39s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 22.22s ( 23.84s elapsed) %GC time 1.3% (1.7% elapsed) Alloc rate 622,864,614 bytes per MUT second Productivity 98.7% of total user, 92.0% of total elapsed ./a.out +RTS -s 22.22s user 0.16s system 93% cpu 23.843 total All are compiled with optimizations. Regards

Daniel Fischer

9:46 p.m.

New subject: When, if ever, does Haskell "calculate once"?

On Thursday 06 May 2010 22:49:40, Maciej Piechotka wrote:

...

On Thu, 2010-05-06 at 15:37 -0400, Brent Yorgey wrote:

...
On Thu, May 06, 2010 at 11:35:15AM -0700, Travis Erdman wrote:

...
Two questions here, I think they might be related, perhaps even the

same, but I am not sure, so I will ask both:

...
Q1: Re the function f below, I like the first implementation as

it's "cleaner", but is the second implementation necessary for performance purposes?

...
-- g = some CPU intensive function

-- alternate 1 f a b = c + (g b) where c = dosomethingelse a (g b)

-- alternate 2 f a b = c + saveit where saveit = g b c = dosomethingelse a saveit

You need alternative 2. In general, GHC (and, I imagine, other Haskell compilers) do not do much common subexpression elimination, because in some cases it can be a *pessimization*. The classic example is

length [1..1000000] + length [1..1000000]

vs

let a = [1..1000000] in length a + length a

Not a particularly fortunate example, with optimisations turned on, GHC shares the common subexpression (length a), so that's calculated only once: share :: Int share = let a = [1 .. 100000000] in length a + length a Core: Share.share :: GHC.Types.Int GblId [Str: DmdType] Share.share = case GHC.List.$wlen @ GHC.Integer.Type.Integer Share.share_a 0 of ww_amc { __DEFAULT -> GHC.Types.I# (GHC.Prim.+# ww_amc ww_amc) } Consider let a = [1 .. 100000000] in length a + sum a vs length [1 .. 100000000] + sum [1 .. 100000000]

...

...
The first will execute in constant space, since each list will be lazily generated as needed by the length function and then the garbage collector will come along behind length and get rid of the nodes that have already been processed. However, in the second expression, the garbage collector cannot get rid of the nodes that are already processed by the first call to length, since the second call to length still needs the list. So the entire list [1..1000000] will end up being in memory at once.

Hmm:

...
cFoldr :: (a -> b -> b) -> (a -> c -> c) -> (b, c) -> [a] -> (b, c) cFoldr f g ~(b, c) [] = (b, c) cFoldr f g ~(b, c) (x:xs) = let (b', c') = cFoldr f g (b, c) xs in (f x b', g x c')

cFoldl' :: (b -> a -> b) -> (c -> a -> c) -> (b, c) -> [a] -> (b, c) cFoldl' f g bc xs0 = lgo bc xs0 where lgo (b, c) [] = (b, c) lgo (b, c) (x:xs) = let b' = f b x c' = g c x bc' = b' `seq` c' `seq` (b', c') in bc' `seq` lgo bc' xs

No. cFoldl' f g (b0,c0) xs0 = lgo b0 c0 xs0 where lgo b c [] = (b,c) lgo !b !c (x:xs) = lgo (f b x) (g c x) xs

...

...
lengthFold :: Int -> a -> Int lengthFold !n _ = n + 1

size = 100000000

main = let a = [1..size] l = length a + length a in print $! l

200000000 8,031,190,480 bytes allocated in the heap

64-bit system? I get 200000000 4,015,621,900 bytes allocated in the heap 187,840 bytes copied during GC 1,472 bytes maximum residency (1 sample(s)) 29,892 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 7659 collections, 0 parallel, 0.10s, 0.09s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 3.38s ( 3.41s elapsed) GC time 0.10s ( 0.09s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 3.48s ( 3.51s elapsed) %GC time 2.9% (2.7% elapsed) Alloc rate 1,187,979,304 bytes per MUT second Productivity 97.1% of total user, 96.4% of total elapsed

...

741,264 bytes copied during GC 1,920 bytes maximum residency (1 sample(s)) 28,216 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 15318 collections, 0 parallel, 0.14s, 0.18s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.00s ( 0.00s elapsed) MUT time 7.15s ( 7.26s elapsed) GC time 0.14s ( 0.18s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.29s ( 7.44s elapsed)

%GC time 1.9% (2.5% elapsed)

Alloc rate 1,123,257,562 bytes per MUT second

Productivity 98.1% of total user, 96.1% of total elapsed

./a.out +RTS -s 7.29s user 0.05s system 98% cpu 7.450 total

...
main = print $! length [1..size] + length [1..size]

200000000 16,062,318,576 bytes allocated in the heap 1,476,904 bytes copied during GC 2,000 bytes maximum residency (1 sample(s)) 27,992 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 30637 collections, 0 parallel, 0.29s, 0.37s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.00s ( 0.00s elapsed) MUT time 15.57s ( 15.90s elapsed) GC time 0.29s ( 0.37s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 15.87s ( 16.27s elapsed)

%GC time 1.8% (2.3% elapsed)

Alloc rate 1,031,313,475 bytes per MUT second

Productivity 98.2% of total user, 95.7% of total elapsed

./a.out +RTS -s 15.87s user 0.11s system 98% cpu 16.272 total

And that is strange, because I get the same figures for that one as for the first (times differ by a few hundredths of a second). Is that a difference between 32-bit code generator and 64-bit or between GHC versions (6.12.2 here, but 6.10.3 gives roughly the same results)?

...

...
main = print $! uncurry (+) (cFoldl' lengthFold lengthFold (0, 0) [1..size])

And that gives the same figures as the other two (plus/minus 0.05s).

...

All are compiled with optimizations.

All compiled with -O2.

Maciej Piechotka

7 May 7 May

1:15 a.m.

New subject: [Long, probably not-beginners anymore] Parallel folds and folds as arrows (was: Re: Re: When, if ever, does Haskell "calculate once"?)

On Thu, 2010-05-06 at 23:46 +0200, Daniel Fischer wrote:

...

Share.share :: GHC.Types.Int GblId [Str: DmdType] Share.share = case GHC.List.$wlen @ GHC.Integer.Type.Integer Share.share_a 0 of ww_amc { __DEFAULT -> GHC.Types.I# (GHC.Prim.+# ww_amc ww_amc) }

Hmm. What's the name of this form and how to get it?

...

No.

cFoldl' f g (b0,c0) xs0 = lgo b0 c0 xs0 where lgo b c [] = (b,c) lgo !b !c (x:xs) = lgo (f b x) (g c x) xs

Ok. Fixed (I tried fast rewrite from foldr')

...

64-bit system? I get

64 bit, GHC 6.12.2. % ghc -V The Glorious Glasgow Haskell Compilation System, version 6.12.2 % file a.out a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.9, not stripped

...

And that is strange, because I get the same figures for that one as for the first (times differ by a few hundredths of a second).

Fixed or non-fixed version?

...

Is that a difference between 32-bit code generator and 64-bit or between GHC versions (6.12.2 here, but 6.10.3 gives roughly the same results)?

Hmm. Compiler and platform matches. Unless you use some other 64-bit platform - not x86-64 ;)

...

...
...
main = print $! uncurry (+) (cFoldl' lengthFold lengthFold (0, 0) [1..size])

And that gives the same figures as the other two (plus/minus 0.05s).

...
All are compiled with optimizations.

All compiled with -O2.

Hmm. Difference between -O1 and -O2 Fixed versions and with sum (and -O2):

...

main = let a = [1..size] l = length a + sum a in print $! l

Lot's of memory(Over 3 GiB). I voluntarily killed process

...

main = print $! length [1..size] + sum [1..size]

Lot's of memory(Over 3 GiB). I voluntarily killed process. So far as being inplace.

...

main = print $! uncurry (+) (cFoldl' lengthFold (+) (0, 0) [1..size])

5000000150000000 16,889,753,976 bytes allocated in the heap 3,356,480 bytes copied during GC 1,976 bytes maximum residency (1 sample(s)) 28,200 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 32216 collections, 0 parallel, 0.29s, 0.38s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 14.89s ( 15.12s elapsed) GC time 0.29s ( 0.38s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 15.18s ( 15.50s elapsed) %GC time 1.9% (2.4% elapsed) Alloc rate 1,134,094,110 bytes per MUT second Productivity 98.1% of total user, 96.1% of total elapsed ./a.out +RTS -s 15.18s user 0.11s system 98% cpu 15.503 total Lowered to size = 100000 (bigger causes stack overflow in first main):

...

main = let a = [1..size] l = length a + sum a in print $! l

5000150000 22,045,352 bytes allocated in the heap 18,781,768 bytes copied during GC 6,316,904 bytes maximum residency (4 sample(s)) 3,141,912 bytes maximum slop 17 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 23 collections, 0 parallel, 0.05s, 0.05s elapsed Generation 1: 4 collections, 0 parallel, 0.03s, 0.04s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.02s ( 0.03s elapsed) GC time 0.08s ( 0.09s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.10s ( 0.12s elapsed) %GC time 78.4% (75.8% elapsed) Alloc rate 1,002,334,818 bytes per MUT second Productivity 20.6% of total user, 17.2% of total elapsed ./a.out +RTS -s 0.10s user 0.02s system 96% cpu 0.129 total

...

main = print $! length [1..size] + sum [1..size]

5000150000 30,077,024 bytes allocated in the heap 17,482,888 bytes copied during GC 5,602,600 bytes maximum residency (4 sample(s)) 3,144,312 bytes maximum slop 15 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 38 collections, 0 parallel, 0.04s, 0.05s elapsed Generation 1: 4 collections, 0 parallel, 0.03s, 0.03s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.02s ( 0.03s elapsed) GC time 0.07s ( 0.08s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.09s ( 0.11s elapsed) %GC time 72.0% (70.0% elapsed) Alloc rate 1,156,986,613 bytes per MUT second Productivity 26.9% of total user, 22.9% of total elapsed ./a.out +RTS -s 0.09s user 0.02s system 97% cpu 0.116 total

...

main = print $! uncurry (+) (cFoldl' lengthFold (+) (0, 0) [1..size])

5000150000 17,128,128 bytes allocated in the heap 10,608 bytes copied during GC 2,072 bytes maximum residency (1 sample(s)) 28,024 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 32 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.02s ( 0.02s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.02s ( 0.02s elapsed) %GC time 5.3% (3.9% elapsed) Alloc rate 951,721,286 bytes per MUT second Productivity 94.7% of total user, 90.2% of total elapsed ./a.out +RTS -s 0.02s user 0.00s system 96% cpu 0.024 total It seems that the best, at least for large inputs. ----------------------------------------------------------------------- On the other hand it seems to form an arrow[1]. First the result of test: 5000000150000000 12,800,063,440 bytes allocated in the heap 2,545,048 bytes copied during GC 1,968 bytes maximum residency (1 sample(s)) 28,216 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 24414 collections, 0 parallel, 0.24s, 0.29s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 7.18s ( 7.35s elapsed) GC time 0.24s ( 0.29s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.42s ( 7.64s elapsed) %GC time 3.3% (3.8% elapsed) Alloc rate 1,782,759,994 bytes per MUT second Productivity 96.7% of total user, 93.9% of total elapsed ./a.out +RTS -s 7.42s user 0.09s system 98% cpu 7.648 total (Yes - lower then the code above - result reproducable). Code:

...

{-# LANGUAGE BangPatterns #-} import Control.Arrow import Control.Category import Data.List import Prelude hiding (id, (.))

newtype FoldlArrow a b c = FoldlArrow (a -> b -> c)

instance Category (FoldlArrow a) where id = FoldlArrow $ \a !b -> b (FoldlArrow f) . (FoldlArrow g) = FoldlArrow $ \a !b -> let !c = g a b in f a c

instance Arrow (FoldlArrow a) where arr f = FoldlArrow $ const f first (FoldlArrow f) = FoldlArrow $ \a (!b, !c) -> (f a b, c) second (FoldlArrow f) = FoldlArrow $ \a (!b, !c) -> (b, f a c) (FoldlArrow f) *** (FoldlArrow g) = FoldlArrow $ \a (!b, !c) -> (f a b, g a c) (FoldlArrow f) &&& (FoldlArrow g) = FoldlArrow $ \a !b -> (f a b, g a b)

instance ArrowChoice (FoldlArrow a) where left (FoldlArrow f) = FoldlArrow left' where left' a (Left !l) = (Left $! f a l) left' a (Right !r) = (Right r) right (FoldlArrow f) = FoldlArrow right' where right' a (Left !l) = (Left l) right' a (Right !r) = (Right $! f a r) (FoldlArrow f) +++ (FoldlArrow g) = FoldlArrow choice where choice a (Left !l) = (Left $! f a l) choice a (Right !r) = (Right $! g a r) (FoldlArrow f) ||| (FoldlArrow g) = FoldlArrow choice where choice a (Left !l) = f a l choice a (Right !r) = g a r

instance ArrowApply (FoldlArrow a) where app = FoldlArrow $ \a (FoldlArrow !f, !b) -> f a b

doFoldl :: FoldlArrow a b b -> b -> [a] -> b doFoldl (FoldlArrow f) = foldl' (flip f)

element :: FoldlArrow a b a element = FoldlArrow const

lengthA :: FoldlArrow a Int Int lengthA = arr (+1)

sumA :: Num a => FoldlArrow a a a sumA = arr (uncurry (+)) . (element &&& id)

size = 100000000

main = print $! uncurry (+) $ doFoldl (lengthA *** sumA) (0, 0) [1..size]

On the other hand if I use arrow syntax:

...

myArr :: Num a => FoldlArrow a (a, a) (a, a) myArr = proc (l, s) -> do e <- element -< () returnA -< (l + 1, s + e)

It starts consuming memory as well. Somewhere is lazy passing of value but I cannot find where Regards PS. As it is probably out of scope and topic of beginners mailing list I'm CC'ing cafe (possibly beginners should be dropped). [1] It can be extended to work on other arrows as well - not only (->).

Daniel Fischer

2:15 p.m.

New subject: [Long, probably not-beginners anymore] Parallel folds and folds as arrows (was: Re: Re: When, if ever, does Haskell "calculate once"?)

On Friday 07 May 2010 03:15:19, Maciej Piechotka wrote:

...

On Thu, 2010-05-06 at 23:46 +0200, Daniel Fischer wrote:

...
Share.share :: GHC.Types.Int GblId [Str: DmdType] Share.share = case GHC.List.$wlen @ GHC.Integer.Type.Integer Share.share_a 0 of ww_amc { __DEFAULT -> GHC.Types.I# (GHC.Prim.+# ww_amc ww_amc) }

Hmm. What's the name of this form and how to get it?

It's GHC's intermediate language, named Core. It's sort of a slimmed down Haskell. You can get it by a) compiling with the -ddump-simpl flag (e.g. ghc -O2 -ddump-simpl --make prog > prog.core If you don't redirect, it's spat out to stdout, better to have it in a file for reading. Also, it's easier to read with syntax-highlighting, plain Haskell-highlighting already goes a long way.) b) using Don Stewart's ghc-core (http://hackage.haskell.org/package/ghc- core), e.g. ghc.core -f html -- -O2 Source.hs > Source.html Looking at the core, you can see what GHC really makes from your code.

...

...
No.

cFoldl' f g (b0,c0) xs0 = lgo b0 c0 xs0 where lgo b c [] = (b,c) lgo !b !c (x:xs) = lgo (f b x) (g c x) xs

Ok. Fixed (I tried fast rewrite from foldr')

Doesn't matter when compiled with optimisations (produces nearly identical core), but without optimisations the original constructs a new pair in each step. However, I inadvertently swapped the clauses of lgo (could also be fixed by putting the bang patterns in the first clause). That didn't matter as long as both accumulating functions were lengthFold, but it makes a big difference when one is (+).

...

...
64-bit system? I get

64 bit, GHC 6.12.2. % ghc -V The Glorious Glasgow Haskell Compilation System, version 6.12.2 % file a.out a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.9, not stripped

...
And that is strange, because I get the same figures for that one as for the first (times differ by a few hundredths of a second).

Fixed or non-fixed version?

...
Is that a difference between 32-bit code generator and 64-bit or between GHC versions (6.12.2 here, but 6.10.3 gives roughly the same results)?

Hmm. Compiler and platform matches. Unless you use some other 64-bit platform - not x86-64 ;)

No, I have a 32-bit system (I got almost exactly half the allocation figures you got, so I suspected your Ints [and small Integers] were twice as large as mine).

...

...
...
...
main = print $! uncurry (+) (cFoldl' lengthFold lengthFold (0, 0) [1..size])

And that gives the same figures as the other two (plus/minus 0.05s).

...
All are compiled with optimizations.

All compiled with -O2.

Hmm. Difference between -O1 and -O2

No, they produce identical core for these.

...

Fixed versions and with sum (and -O2):

...
main = let a = [1..size] l = length a + sum a in print $! l

Lot's of memory(Over 3 GiB). I voluntarily killed process

Yes, of course.

...

...
main = print $! length [1..size] + sum [1..size]

Lot's of memory(Over 3 GiB). I voluntarily killed process.

So far as being inplace.

Yes. Actually, with optimisations turned on, GHC *does* share the list [1 .. size]. Oops. It's not shared with -O0 unless you give it a name.

...

...
main = print $! uncurry (+) (cFoldl' lengthFold (+) (0, 0) [1..size])

5000000150000000 16,889,753,976 bytes allocated in the heap 3,356,480 bytes copied during GC 1,976 bytes maximum residency (1 sample(s)) 28,200 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 32216 collections, 0 parallel, 0.29s, 0.38s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.00s ( 0.00s elapsed) MUT time 14.89s ( 15.12s elapsed) GC time 0.29s ( 0.38s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 15.18s ( 15.50s elapsed)

%GC time 1.9% (2.4% elapsed)

Alloc rate 1,134,094,110 bytes per MUT second

Productivity 98.1% of total user, 96.1% of total elapsed

./a.out +RTS -s 15.18s user 0.11s system 98% cpu 15.503 total

With the really fixed cFoldl' (overflows of course): ./lsCFoldl +RTS -s 1087459712 4,015,621,900 bytes allocated in the heap 126,564 bytes copied during GC 1,460 bytes maximum residency (1 sample(s)) 29,892 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 7659 collections, 0 parallel, 0.07s, 0.05s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 1.50s ( 1.54s elapsed) GC time 0.07s ( 0.05s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.56s ( 1.59s elapsed) %GC time 4.3% (3.1% elapsed) Alloc rate 2,684,070,586 bytes per MUT second Productivity 95.7% of total user, 94.2% of total elapsed Misfixed: ./lsTFoldl +RTS -s 1087459712 6,400,058,168 bytes allocated in the heap 638,784 bytes copied during GC 1,496 bytes maximum residency (1 sample(s)) 29,892 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 12207 collections, 0 parallel, 0.04s, 0.08s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 2.64s ( 2.64s elapsed) GC time 0.04s ( 0.08s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.68s ( 2.72s elapsed) %GC time 1.3% (3.1% elapsed) Alloc rate 2,420,446,752 bytes per MUT second Productivity 98.7% of total user, 97.1% of total elapsed

...

-----------------------------------------------------------------------

On the other hand it seems to form an arrow[1]. First the result of test: 5000000150000000 12,800,063,440 bytes allocated in the heap 2,545,048 bytes copied during GC 1,968 bytes maximum residency (1 sample(s)) 28,216 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 24414 collections, 0 parallel, 0.24s, 0.29s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.00s ( 0.00s elapsed) MUT time 7.18s ( 7.35s elapsed) GC time 0.24s ( 0.29s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.42s ( 7.64s elapsed)

%GC time 3.3% (3.8% elapsed)

Alloc rate 1,782,759,994 bytes per MUT second

Productivity 96.7% of total user, 93.9% of total elapsed

./a.out +RTS -s 7.42s user 0.09s system 98% cpu 7.648 total

(Yes - lower then the code above - result reproducable).

Code:

...
{-# LANGUAGE BangPatterns #-} import Control.Arrow <snip> main = print $! uncurry (+) $ doFoldl (lengthA *** sumA) (0, 0) [1..size]

Hmm. For me, that gives identical performance (and identical core for main) as the misfixed cFoldl', the really fixed cFoldl' does significantly better. It becomes equal (for this task at least) if I make FoldlArrows also strict in the first argument (replace all "\a" with "\ !a").

...

Regards

PS. As it is probably out of scope and topic of beginners mailing list I'm CC'ing cafe (possibly beginners should be dropped).

[1] It can be extended to work on other arrows as well - not only (->).

Daniel Fischer

2:37 p.m.

New subject: [Long, probably not-beginners anymore] Parallel folds and folds as arrows (was: Re: Re: When, if ever, does Haskell "calculate once"?)

On Friday 07 May 2010 16:15:41, Daniel Fischer wrote:

...

b) using Don Stewart's ghc-core (http://hackage.haskell.org/package/ghc- core), e.g. ghc.core -f html -- -O2 Source.hs > Source.html

And of course, the html backend of ghc-core was removed with version 0.5 :( If you want html output, $ cabal install --constraint="hscolour == 1.13" ghc-core-0.4.3

Chaddaï Fouché

10 May 10 May

1:21 p.m.

On Thu, May 6, 2010 at 9:37 PM, Brent Yorgey wrote:

...

...
data MyList a = MyList {mylist::[a], mylength::Int}

There's no magic going on here, if you call a function to compute some complicated feature of a data structure multiple places in your code, it will be computed multiple times, just like in any other language. Caching the features you need as in the above example is a good idea if the data structures won't change often, and you really do need the features many times.

Notice that if you don't need mylength anywhere in your code, it won't be computed (except if you made your smart constructor extra-strict) so it's way easier in Haskell to implement the "compute this only once but only if I really need it" scheme that you often see in other languages. -- Jedaï

5539

Age (days ago)

5543

Last active (days ago)

List overview

Download

8 comments

5 participants

participants (5)

Brent Yorgey
Chaddaï Fouché
Daniel Fischer
Maciej Piechotka
Travis Erdman