newtypes and optimization

Hi, I have a statically-sized-list library that I use for linear algebra stuff. It's got a vector type something like this:
data V a b = V a b
so that a 3D vector is
type Vec3 a = V a (V a (V a ()))
and I have type classes for operations on these things, like so:
class VZipWith a b c u v w | {-lots of fundeps-} where vzipWith :: (a -> b -> c) -> u -> v -> w
instance VZipWith a b c (V a ()) (V b ()) (V c ()) where vzipWith f (V x ()) (V y ()) = V (f x y) ()
instance VZipWith a b c (V a u) (V b v) (V c w) => VZipWith a b c (V a (V a u)) (V b (V b v)) (V c (V c w)) where vzipWith f (V x u) (V y v) = V (f x y) (vzipWith f u v)
so that vector addition is
vadd = vzipWith (+)
I put strictness annotations and INLINE pragmas all over the place, and GHC does wonders with it. Using Storable instances something like the following,
instance Storable a => Storable (V a ()) where sizeOf _ = sizeOf (undefined::a) peek p = peek (castPtr p) >>= \a -> return (V a ()) --etc
instance (Storable a, Storable v) => Storable (V a v) where sizeOf _ = sizeOf (undefined::a) + sizeOf (undefined::v) peek p = a <- peek (castPtr p) v <- peek (castPtr (p`plusPtr`sizeOf(undefined::a))) return (V a v)
GHC can turn a loop like this,
forM_ [0..n] $ \i -> do a <- peekElemOff aptr i b <- peekElemOff bptr i pokeElemOff cptr i (vadd a b)
into something as fast as C, using no heap. You look at the core and its nothing but readDoubleOffAddr#, +## and the like. I went so far as to generalize this to matrices with things like vector-matrix and matrix-matrix multiplication, determinants and all that and, when used in loops like above, it's consistently as fast or even faster than C. However when I do this:
newtype Quaternion = Q (Vec4 Double)
Everything is ruined. Functions like peek and vadd are no longer inlined, intermediate linked lists are created all over the place. The Quaternion Storable instance looks like this
instance Storable s => Storable (Quaternion s) where sizeOf _ = 4*sizeOf (undefined::s) peek p = peek (castPtr p :: Ptr (Vec4 s)) >>= \v -> return (Q v)
with strictness annotations and INLINEs for everything. I also tried automatic newtype deriving, with no luck. Why does a newtype defeat so much of the optimization? Thanks, Scott

On Wed, Dec 12, 2007 at 11:02:15AM -0700, Scott Dillard wrote:
with strictness annotations and INLINEs for everything. I also tried automatic newtype deriving, with no luck. Why does a newtype defeat so much of the optimization?
Thanks, Scott
(Not a GHC developer, but someone fairly familiar with how the Simons work) What version of GHC are you using? The implementation of newtypes was completely redone in the 6.7.x period. Do you have a fairly small complete working example? If so, link to or attach a tarball - will make their jobs much easier. Stefan

stefanor:
On Wed, Dec 12, 2007 at 11:02:15AM -0700, Scott Dillard wrote:
with strictness annotations and INLINEs for everything. I also tried automatic newtype deriving, with no luck. Why does a newtype defeat so much of the optimization?
Thanks, Scott
(Not a GHC developer, but someone fairly familiar with how the Simons work)
What version of GHC are you using? The implementation of newtypes was completely redone in the 6.7.x period.
Do you have a fairly small complete working example? If so, link to or attach a tarball - will make their jobs much easier.
Stefan
Yeah, this sounds like maybe a bug, or maybe something wrong. We need to investigate! An example please. -- Don

You can find an example at:
graphics.cs.ucdavis.edu/~sdillard/newtype-maybe-bug.tar.gz
Here's my session with it:
ghc --make -O2 -DFAST Test
time ./Test +RTS -tstderr
<
stefanor:
On Wed, Dec 12, 2007 at 11:02:15AM -0700, Scott Dillard wrote:
with strictness annotations and INLINEs for everything. I also tried automatic newtype deriving, with no luck. Why does a newtype defeat so much of the optimization?
Thanks, Scott
(Not a GHC developer, but someone fairly familiar with how the Simons work)
What version of GHC are you using? The implementation of newtypes was completely redone in the 6.7.x period.
Do you have a fairly small complete working example? If so, link to or attach a tarball - will make their jobs much easier.
Stefan
Yeah, this sounds like maybe a bug, or maybe something wrong. We need to investigate! An example please.
-- Don

| However when I do this: | | > newtype Quaternion = Q (Vec4 Double) | | Everything is ruined. Functions like peek and vadd are no longer inlined, | intermediate linked lists are created all over the place. The Quaternion | Storable instance looks like this Turns out this is a perf bug in 6.8 that I fixed a couple of weeks ago in the HEAD, but didn't merge. (Implication constraints aren't getting INLINE pragmas.) With the HEAD we get this, which should make you happy. The HEAD allocates only 9kbytes in both -DSLOW and -DFAST, whereas 6.8 allocates 21kbytes in -DFAST (and off the map for -DSLOW). I guess we should get this patch into the 6.8 branch. Simon $gpj --make -O2 -DFAST Test -o Test-fast [1 of 2] Compiling VecMath ( VecMath.hs, VecMath.o ) NOTE: Simplifier still going after 4 iterations; bailing out. Size = 7311 [2 of 2] Compiling Main ( Test.hs, Test.o ) Linking Test-fast ... bash-3.1$ rm -f *.o bash-3.1$ $gpj --make -O2 -DSLOW Test -o Test-slow [1 of 2] Compiling VecMath ( VecMath.hs, VecMath.o ) NOTE: Simplifier still going after 4 iterations; bailing out. Size = 7311 [2 of 2] Compiling Main ( Test.hs, Test.o ) Linking Test-slow ... bash-3.1$ time ./Test-fast +RTS -sstderr ./Test-fast +RTS -sstderr 9,432 bytes allocated in the heap 552 bytes copied during GC (scavenged) 0 bytes copied during GC (not scavenged) 32,768 bytes maximum residency (1 sample(s)) 1 collections in generation 0 ( 0.00s) 1 collections in generation 1 ( 0.00s) 1 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 5.88s ( 6.87s elapsed) GC time 0.00s ( 0.01s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 5.88s ( 6.88s elapsed) %GC time 0.0% (0.1% elapsed) Alloc rate 1,605 bytes per MUT second Productivity 100.0% of total user, 85.4% of total elapsed real 0m6.973s user 0m5.880s sys 0m0.956s bash-3.1$ time ./Test-slow +RTS -sstderr ./Test-slow +RTS -sstderr 9,432 bytes allocated in the heap 552 bytes copied during GC (scavenged) 0 bytes copied during GC (not scavenged) 32,768 bytes maximum residency (1 sample(s)) 1 collections in generation 0 ( 0.00s) 1 collections in generation 1 ( 0.00s) 1 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 5.90s ( 6.83s elapsed) GC time 0.00s ( 0.03s elapsed) EXIT time 0.00s ( 0.03s elapsed) Total time 5.90s ( 6.86s elapsed) %GC time 0.0% (0.4% elapsed) Alloc rate 1,597 bytes per MUT second Productivity 100.0% of total user, 86.0% of total elapsed real 0m6.958s user 0m5.904s sys 0m1.004s bash-3.1$ ghc --make -O2 -DFAST Test -o Test-682 [1 of 2] Compiling VecMath ( VecMath.hs, VecMath.o ) [2 of 2] Compiling Main ( Test.hs, Test.o ) Linking Test-682 ... bash-3.1$ time ./Test-682 +RTS -sstderr ./Test-682 +RTS -sstderr 21,752 bytes allocated in the heap 552 bytes copied during GC (scavenged) 0 bytes copied during GC (not scavenged) 32,768 bytes maximum residency (1 sample(s)) 1 collections in generation 0 ( 0.00s) 1 collections in generation 1 ( 0.00s) 1 Mb total memory in use INIT time 0.00s ( 0.00s elapsed) MUT time 5.77s ( 6.69s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 5.77s ( 6.69s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 3,770 bytes per MUT second Productivity 100.0% of total user, 86.2% of total elapsed real 0m6.787s user 0m5.768s sys 0m1.016s bash-3.1$
participants (4)
-
Don Stewart
-
Scott Dillard
-
Simon Peyton-Jones
-
Stefan O'Rear