Understanding GHC allocations

I'm trying to optimize the following program: http://github.com/feuerbach/particles/blob/303c8a17c9b732e22457b5409bdce4b75... Of course general suggestions are welcome (BTW I'm going to give a try to vector), but currently I'm concerned with two questions: 1. Heavy allocations in 'distance' function. Here is (part of) the profile: COST CENTRE MODULE %time %alloc ticks bytes d2 Main 9.0 22.0 290 600000000 d Main 8.6 65.9 278 1800000000 d1 Main 7.5 11.0 242 299700000
From reading core I got the impression that everything is strict & unboxed. Perhaps this is related to creating some closures? How to get rid of those allocations?
2. Again from reading the core I learned that although 'l' and other constants are inlined, their type is boxed Double. This makes sense since CAFs are evaluated on demand, but obviously in this particular case it does not make sense, so can I somehow make them unboxed? -- Roman I. Cheplyaka :: http://ro-che.info/ "Don't let school get in the way of your education." - Mark Twain

* Roman Cheplyaka
I'm trying to optimize the following program: http://github.com/feuerbach/particles/blob/303c8a17c9b732e22457b5409bdce4b75...
Of course general suggestions are welcome (BTW I'm going to give a try to vector), but currently I'm concerned with two questions:
1. Heavy allocations in 'distance' function. Here is (part of) the profile:
COST CENTRE MODULE %time %alloc ticks bytes
d2 Main 9.0 22.0 290 600000000 d Main 8.6 65.9 278 1800000000 d1 Main 7.5 11.0 242 299700000
From reading core I got the impression that everything is strict & unboxed. Perhaps this is related to creating some closures? How to get rid of those allocations?
2. Again from reading the core I learned that although 'l' and other constants are inlined, their type is boxed Double. This makes sense since CAFs are evaluated on demand, but obviously in this particular case it does not make sense, so can I somehow make them unboxed?
Forgot to mention, I'm using ghc 6.12.1, compiling with -O2. -- Roman I. Cheplyaka :: http://ro-che.info/ "Don't let school get in the way of your education." - Mark Twain

On Thursday 17 June 2010 11:43:09, Roman Cheplyaka wrote:
* Roman Cheplyaka
[2010-06-17 12:40:59+0300] I'm trying to optimize the following program: http://github.com/feuerbach/particles/blob/303c8a17c9b732e22457b5409bd ce4b7520be94a/run.hs
Of course general suggestions are welcome (BTW I'm going to give a try to vector), but currently I'm concerned with two questions:
1. Heavy allocations in 'distance' function. Here is (part of) the profile:
COST CENTRE MODULE %time %alloc ticks bytes
d2 Main 9.0 22.0 290 600000000 d Main 8.6 65.9 278 1800000000 d1 Main 7.5 11.0 242 299700000
I suspect the distance function is not what you intended, distance :: Double -> Double -> Double distance !x1 !x2 = {-# SCC "min" #-} min d1 d2 where d = {-# SCC "d" #-} x1 - x2 d1 = {-# SCC "d1" #-} abs d d2 = {-# SCC "d2" #-} abs $ l - d that would give distance 0.2 24.8 = 24.6, while the wrapping suggests that it should be 0.4, so in d2, it should be d1 instead of d. Either way, both d and d1 are <= 25, so the 'abs' in d2 is superfluous, removing that alone reduces the allocations drastically and the running time by ~40% (astonishingly, not in the profiling version, I suspect it's because profiling needs a few registers so that there aren't enough left for the loops on my box). Further, if you export only main from the module, you allow GHC to be more aggressive with optimising. On my box, that leads to more allocation again because there aren't enough registers, but things become a little faster. Also, a few more bangs here and there plus a couple of INLINE and UNPACK pragmas speed things up, the (on my box) fastest combination I've found is attached, it has the same semantics for distance as the original code, changing distance to what I believe it should be unfortunately slows it down significantly. On my box, I get a further big speedup by compiling with -O2 -fexcess-precision -fvia-C -optc-O3
From reading core I got the impression that everything is strict &
unboxed.
Not everything, there lurk a few boxed Doubles e.g. in average.
Perhaps this is related to creating some closures? How to get rid of those allocations?
Do you need to? Sometimes an allocating loop is faster than a non- allocating one (of course, if you have enough registers for the allocating loop to run entirely in registers, it'll be much faster still). IMO, the important criteria are time and resident memory, not allocation.
2. Again from reading the core I learned that although 'l' and other constants are inlined, their type is boxed Double. This makes sense since CAFs are evaluated on demand, but obviously in this particular case it does not make sense, so can I somehow make them unboxed?
Putting bangs in the loops where they are used likely uses the unboxed values; not exporting them too.
Forgot to mention, I'm using ghc 6.12.1, compiling with -O2.

* Daniel Fischer
On Thursday 17 June 2010 11:43:09, Roman Cheplyaka wrote:
* Roman Cheplyaka
[2010-06-17 12:40:59+0300] I'm trying to optimize the following program: http://github.com/feuerbach/particles/blob/303c8a17c9b732e22457b5409bd ce4b7520be94a/run.hs
Of course general suggestions are welcome (BTW I'm going to give a try to vector), but currently I'm concerned with two questions:
1. Heavy allocations in 'distance' function. Here is (part of) the profile:
COST CENTRE MODULE %time %alloc ticks bytes
d2 Main 9.0 22.0 290 600000000 d Main 8.6 65.9 278 1800000000 d1 Main 7.5 11.0 242 299700000
I suspect the distance function is not what you intended, distance 0.2 24.8 = 24.6, while the wrapping suggests that it should be 0.4, so in d2, it should be d1 instead of d.
Good catch! :)
Either way, both d and d1 are <= 25, so the 'abs' in d2 is superfluous,
Correct
removing that alone reduces the allocations drastically and the running time by ~40%
That's exactly what I'm asking about. 'abs' in C does not require any allocations, does it? So why does it require any allocations in Haskell, assuming we've got no lazyness, typeclass indirection (I assume 'abs' was specialized and inlined) or other high-level features in resulted low-level code?
Further, if you export only main from the module, you allow GHC to be more aggressive with optimising. On my box, that leads to more allocation again because there aren't enough registers, but things become a little faster.
Good idea indeed.
Perhaps this is related to creating some closures? How to get rid of those allocations?
Do you need to? Sometimes an allocating loop is faster than a non- allocating one (of course, if you have enough registers for the allocating loop to run entirely in registers, it'll be much faster still).
IMO, the important criteria are time and resident memory, not allocation.
Maybe, but what bothers me is that I can't answer myself where are those allocation from. What problem do they solve?
2. Again from reading the core I learned that although 'l' and other constants are inlined, their type is boxed Double. This makes sense since CAFs are evaluated on demand, but obviously in this particular case it does not make sense, so can I somehow make them unboxed?
Putting bangs in the loops where they are used likely uses the unboxed values; not exporting them too.
I'll play with this, thanks. -- Roman I. Cheplyaka :: http://ro-che.info/ "Don't let school get in the way of your education." - Mark Twain

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jun 17, 2010, at 12:35 , Roman Cheplyaka wrote:
That's exactly what I'm asking about. 'abs' in C does not require any allocations, does it? So why does it require any allocations in Haskell, assuming we've got no lazyness, typeclass indirection (I assume 'abs' was specialized and inlined) or other high-level features in resulted low-level code?
Quite a few obvious specializations / rules are missing; you can propose them. Answering your question about the implementation of abs, my guess is that the current definition works for every type (I'm pretty sure C doesn't know what to do with a Data.Ratio). - -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) iEYEARECAAYFAkwacCwACgkQIn7hlCsL25XS/wCgvZWVoxZLrIlNywg4ZLA6tEwW UpEAnjTLW5E5EHYcOIu0Eq1Rclh2wxfR =BFE2 -----END PGP SIGNATURE-----

On 17.06.10 12:40, Roman Cheplyaka wrote:
From reading core I got the impression that everything is strict& unboxed. Perhaps this is related to creating some closures? How to get rid of those allocations?
Yes, "distance" creates a closure of type @Double -> Double# -> Double@ which is obviously not necessary. I do not know why.
2. Again from reading the core I learned that although 'l' and other constants are inlined, their type is boxed Double. This makes sense since CAFs are evaluated on demand, but obviously in this particular case it does not make sense, so can I somehow make them unboxed Hmm, I learned from "-ddump-core" that "distance" function uses constant "25.0".
There is another way to optimize — make GHC use floating point "abs" processor instruction. Now it uses {{{ abs x | x >= 0.0 = x | otherwise = negateDouble x }}} http://hackage.haskell.org/packages/archive/base/latest/doc/html/src/GHC-Flo... -- Best regards, Roman Beslik.

On 17.06.10 19:12, Roman Beslik wrote:
On 17.06.10 12:40, Roman Cheplyaka wrote:
From reading core I got the impression that everything is strict& unboxed. Perhaps this is related to creating some closures? How to get rid of those allocations? Yes, "distance" creates a closure of type @Double -> Double# -> Double@ which is obviously not necessary. I do not know why. "-funfolding-use-threshold=7" removes that closure.
-- Best regards, Roman Beslik.
participants (4)
-
Brandon S. Allbery KF8NH
-
Daniel Fischer
-
Roman Beslik
-
Roman Cheplyaka