
On Thursday 06 May 2010 22:49:40, Maciej Piechotka wrote:
On Thu, 2010-05-06 at 15:37 -0400, Brent Yorgey wrote:
On Thu, May 06, 2010 at 11:35:15AM -0700, Travis Erdman wrote:
Two questions here, I think they might be related, perhaps even the
same, but I am not sure, so I will ask both:
Q1: Re the function f below, I like the first implementation as
it's "cleaner", but is the second implementation necessary for performance purposes?
-- g = some CPU intensive function
-- alternate 1 f a b = c + (g b) where c = dosomethingelse a (g b)
-- alternate 2 f a b = c + saveit where saveit = g b c = dosomethingelse a saveit
You need alternative 2. In general, GHC (and, I imagine, other Haskell compilers) do not do much common subexpression elimination, because in some cases it can be a *pessimization*. The classic example is
length [1..1000000] + length [1..1000000]
vs
let a = [1..1000000] in length a + length a
Not a particularly fortunate example, with optimisations turned on, GHC shares the common subexpression (length a), so that's calculated only once: share :: Int share = let a = [1 .. 100000000] in length a + length a Core: Share.share :: GHC.Types.Int GblId [Str: DmdType] Share.share = case GHC.List.$wlen @ GHC.Integer.Type.Integer Share.share_a 0 of ww_amc { __DEFAULT -> GHC.Types.I# (GHC.Prim.+# ww_amc ww_amc) } Consider let a = [1 .. 100000000] in length a + sum a vs length [1 .. 100000000] + sum [1 .. 100000000]
The first will execute in constant space, since each list will be lazily generated as needed by the length function and then the garbage collector will come along behind length and get rid of the nodes that have already been processed. However, in the second expression, the garbage collector cannot get rid of the nodes that are already processed by the first call to length, since the second call to length still needs the list. So the entire list [1..1000000] will end up being in memory at once.
Hmm:
cFoldr :: (a -> b -> b) -> (a -> c -> c) -> (b, c) -> [a] -> (b, c) cFoldr f g ~(b, c) [] = (b, c) cFoldr f g ~(b, c) (x:xs) = let (b', c') = cFoldr f g (b, c) xs in (f x b', g x c')
cFoldl' :: (b -> a -> b) -> (c -> a -> c) -> (b, c) -> [a] -> (b, c) cFoldl' f g bc xs0 = lgo bc xs0 where lgo (b, c) [] = (b, c) lgo (b, c) (x:xs) = let b' = f b x c' = g c x bc' = b' `seq` c' `seq` (b', c') in bc' `seq` lgo bc' xs
No. cFoldl' f g (b0,c0) xs0 = lgo b0 c0 xs0 where lgo b c [] = (b,c) lgo !b !c (x:xs) = lgo (f b x) (g c x) xs
lengthFold :: Int -> a -> Int lengthFold !n _ = n + 1
size = 100000000
main = let a = [1..size] l = length a + length a in print $! l
200000000 8,031,190,480 bytes allocated in the heap
64-bit system? I get 200000000 4,015,621,900 bytes allocated in the heap 187,840 bytes copied during GC 1,472 bytes maximum residency (1 sample(s)) 29,892 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 7659 collections, 0 parallel, 0.10s, 0.09s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 3.38s ( 3.41s elapsed) GC time 0.10s ( 0.09s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 3.48s ( 3.51s elapsed) %GC time 2.9% (2.7% elapsed) Alloc rate 1,187,979,304 bytes per MUT second Productivity 97.1% of total user, 96.4% of total elapsed
741,264 bytes copied during GC 1,920 bytes maximum residency (1 sample(s)) 28,216 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 15318 collections, 0 parallel, 0.14s, 0.18s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed) MUT time 7.15s ( 7.26s elapsed) GC time 0.14s ( 0.18s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.29s ( 7.44s elapsed)
%GC time 1.9% (2.5% elapsed)
Alloc rate 1,123,257,562 bytes per MUT second
Productivity 98.1% of total user, 96.1% of total elapsed
./a.out +RTS -s 7.29s user 0.05s system 98% cpu 7.450 total
main = print $! length [1..size] + length [1..size]
200000000 16,062,318,576 bytes allocated in the heap 1,476,904 bytes copied during GC 2,000 bytes maximum residency (1 sample(s)) 27,992 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 30637 collections, 0 parallel, 0.29s, 0.37s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed) MUT time 15.57s ( 15.90s elapsed) GC time 0.29s ( 0.37s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 15.87s ( 16.27s elapsed)
%GC time 1.8% (2.3% elapsed)
Alloc rate 1,031,313,475 bytes per MUT second
Productivity 98.2% of total user, 95.7% of total elapsed
./a.out +RTS -s 15.87s user 0.11s system 98% cpu 16.272 total
And that is strange, because I get the same figures for that one as for the first (times differ by a few hundredths of a second). Is that a difference between 32-bit code generator and 64-bit or between GHC versions (6.12.2 here, but 6.10.3 gives roughly the same results)?
main = print $! uncurry (+) (cFoldl' lengthFold lengthFold (0, 0) [1..size])
And that gives the same figures as the other two (plus/minus 0.05s).
All are compiled with optimizations.
All compiled with -O2.