 
            Hi, I have a statically-sized-list library that I use for linear algebra stuff. It's got a vector type something like this:
data V a b = V a b
so that a 3D vector is
type Vec3 a = V a (V a (V a ()))
and I have type classes for operations on these things, like so:
class VZipWith a b c u v w | {-lots of fundeps-} where vzipWith :: (a -> b -> c) -> u -> v -> w
instance VZipWith a b c (V a ()) (V b ()) (V c ()) where vzipWith f (V x ()) (V y ()) = V (f x y) ()
instance VZipWith a b c (V a u) (V b v) (V c w) => VZipWith a b c (V a (V a u)) (V b (V b v)) (V c (V c w)) where vzipWith f (V x u) (V y v) = V (f x y) (vzipWith f u v)
so that vector addition is
vadd = vzipWith (+)
I put strictness annotations and INLINE pragmas all over the place, and GHC does wonders with it. Using Storable instances something like the following,
instance Storable a => Storable (V a ()) where sizeOf _ = sizeOf (undefined::a) peek p = peek (castPtr p) >>= \a -> return (V a ()) --etc
instance (Storable a, Storable v) => Storable (V a v) where sizeOf _ = sizeOf (undefined::a) + sizeOf (undefined::v) peek p = a <- peek (castPtr p) v <- peek (castPtr (p`plusPtr`sizeOf(undefined::a))) return (V a v)
GHC can turn a loop like this,
forM_ [0..n] $ \i -> do a <- peekElemOff aptr i b <- peekElemOff bptr i pokeElemOff cptr i (vadd a b)
into something as fast as C, using no heap. You look at the core and its nothing but readDoubleOffAddr#, +## and the like. I went so far as to generalize this to matrices with things like vector-matrix and matrix-matrix multiplication, determinants and all that and, when used in loops like above, it's consistently as fast or even faster than C. However when I do this:
newtype Quaternion = Q (Vec4 Double)
Everything is ruined. Functions like peek and vadd are no longer inlined, intermediate linked lists are created all over the place. The Quaternion Storable instance looks like this
instance Storable s => Storable (Quaternion s) where sizeOf _ = 4*sizeOf (undefined::s) peek p = peek (castPtr p :: Ptr (Vec4 s)) >>= \v -> return (Q v)
with strictness annotations and INLINEs for everything. I also tried automatic newtype deriving, with no luck. Why does a newtype defeat so much of the optimization? Thanks, Scott