Re: [Haskell-cafe] "sum" in hmatrix and blas?

9 Jun 2008

      Below are some notes on this for Simon PJ and Alberto.

In general, GHC is doing very well here, with only one small wibble preventing the 
recursive version running as fast as the C version.

xj2106:
...
Alberto Ruiz  writes:
...
My experience is that Haskell allocation time is very fast and usually
negligible in most non trivial matrix computations.
A good thing about
sum v = constant 1 (dim v) <.> v
is that a constant vector is efficiently created internally (not from
an intermediate Haskell list), and the inner product will be computed
by the possibly optimized blas version available in your machine.
You can also write simple definitions like the next one for the
average of the rows of a matrix as a simple vector-matrix product:
mean m = w <> m
    where w = constant k r
          r = rows m
          k = 1 / fromIntegral r
I prefer high level "index free" matrix operations to C-like
code. Definitions are clearer, have less bugs, and are also more
efficient if you use optimized numerical libs.
In any case, many algorithms can be solved by the recursive approach
described by Tomas.
After reading don's blog, I tried to make a test with both
methods.  The code is very simple, as following
module Main where
import System
import Numeric.LinearAlgebra
vsum1 :: Vector Double -> Double
vsum1 v = constant 1 (dim v) <.> v
vsum2 :: Vector Double -> Double
vsum2 v = go (d - 1) 0
    where
      d = dim v
      go :: Int -> Double -> Double
      go 0 s = s + (v @> 0)
      go !j !s = go (j - 1) (s + (v @> j))
mean :: Vector Double -> Double
mean v = vsum v / fromIntegral (dim v)
    where vsum = vsum1
main :: IO ()
main = do
  fn:nrow:ncol:_ <- getArgs
  print . mean . flatten =<< fromFile fn (read nrow, read ncol)
Compile it with "-O2 -optc-O2", and run with a data set of
length 5000000.  The results are
with "vsum1":
 80,077,984 bytes allocated in the heap
      2,208 bytes copied during GC (scavenged)
         64 bytes copied during GC (not scavenged)
 40,894,464 bytes maximum residency (2 sample(s))
  %GC time       0.0%  (0.0% elapsed)
  Alloc rate    35,235,448 bytes per MUT second
./vsum1 huge 5000000 1  2.25s user 0.09s system 99% cpu 2.348 total
This is reasonable, exactly two copies of vector with size
of 40MB.
with "vsum2":
560,743,120 bytes allocated in the heap
     19,160 bytes copied during GC (scavenged)
     15,920 bytes copied during GC (not scavenged)
 40,919,040 bytes maximum residency (2 sample(s))
  %GC time       0.3%  (0.3% elapsed)
  Alloc rate    222,110,261 bytes per MUT second
./mean2 huge 5000000 1  2.53s user 0.06s system 99% cpu 2.598 total
This is strange.  A lot of extra usage of heap?  Probably
because '@>' is not efficient?  So it looks like the
inner-product approach wins with a fairly margin.
Yes, I'd suspect that @> isn't fully unlifted, so you get some heap
allocation on each index, each time around the loop? We'd have to look
at how @> was defined to spot why. 

I began with:

    $ cabal install hmatrix

which failed, due to missing linking against gfortran and glscblas
Added those to the cabal file, and hmatrix was installed.

Looking at the core we get from vsum2, we see:

vsum2 compiles quite well

      $wgo_s1Nn :: Int# -> Double# -> Double#

      $wgo_s1Nn =
        \ (ww3_s1Mc :: Int#) (ww4_s1Mg :: Double#) ->
          case ww3_s1Mc of ds_X17R {
            __DEFAULT ->
              case Data.Packed.Internal.Vector.$wat
                     @ Double Foreign.Storable.$f9 ww_s1Ms ww1_s1Mu (I# ds_X17R)
              of { D# y_a1KQ ->
              $wgo_s1Nn (-# ds_X17R 1) (+## ww4_s1Mg y_a1KQ)
              };
            0 -> +## ww4_s1Mg ww2_s1M7
          };
    } in  $wgo_s1Nn (-# ww_s1Ms 1) 0.0

But note the return value from $wat is boxed, only to be immediately
unboxed. That looks to be the source of the heap allocations.

Let's see if we can help that.

Vector is defined as:

    data Vector t = V { dim  :: Int              -- ^ number of elements
                      , fptr :: ForeignPtr t     -- ^ foreign pointer to the memory block
                      }

While a much better definition would be:

   data Vector t = V { dim  :: {-# UNPACK #-} !Int               -- ^ number of elements
                     , fptr :: {-# UNPACK #-} !(ForeignPtr t)    -- ^ foreign pointer to the memory block
                     }

And we can add some inlining to at' and at.
That might be enough for GHC to see through to the D# boxing.

Now they're fully unfolded and specialised into our source program,

                    True ->
                      case unsafeDupablePerformIO
                             @ Double
                             ((\ (eta1_a1KA :: State# RealWorld) ->
                                 case noDuplicate# eta1_a1KA of s'_a1KB { __DEFAULT ->
                                 case readDoubleOffAddr# @ RealWorld ww1_s1NB ds_X184 s'_a1KB
                                 of wild2_a1Lc { (# s2_a1Le, x_a1Lf #) ->
                                 case touch# @ ForeignPtrContents ww2_s1NC s2_a1Le
                                 of s_a1KS { __DEFAULT ->
     BAD    ----->               (# s_a1KS, D# x_a1Lf #)
                                 }
                                 }
                                 })
                              `cast` (sym ((:CoIO) Double)
                                      :: State# RealWorld
                                         -> (# State# RealWorld, Double #)
                                           ~
                                         IO Double))
                      of wild11_a1LC { D# y_a1LE ->
                      $wgo_s1Oz (-# ds_X184 1) (+## ww4_s1Nq y_a1LE)

But still the readDoubleOffAddr# result is boxed into D#, and then unboxed in the loop.
A GHC optimiser flaw? Possibly the same as:

    http://hackage.haskell.org/trac/ghc/ticket/2289

Simon PJ, does that seem right?

While vsum1 is pretty much unoptimised calls into C;

    vsum1 :: Data.Packed.Internal.Vector.Vector Double
                  -> Double

    vsum1 =
      \ (v_anC :: Data.Packed.Internal.Vector.Vector Double) ->
        Numeric.LinearAlgebra.Linear.dot
          @ Double
          Data.Packed.Internal.Matrix.$f2
          (Data.Packed.Internal.Matrix.$sconstantAux
             Data.Packed.Internal.Matrix.cconstantR
             lit
             (case v_anC of tpl_B2 { Data.Packed.Internal.Vector.V ipv_B3 ipv1_B4 ->
              ipv_B3
              }))
          v_anC

Alberto, I would modify the other types in hmatrix similary,

    data Matrix t = MC { rows :: {-# UNPACK #-} !Int, ... cols :: !Int, cdat :: !(Vector t) }
                  | MF { rows :: {-# UNPACK #-} !Int, ...cols :: !Int, fdat :: !(Vector t) }

ensuring we can keep these things in registers. But it would be wise to have some
benchmarking too. Alberto, what do you think? Do you have some test data we could
play with to see if unboxing the haskell skin over the underlying C calls helps
with Haskell-side loops and iterative calls?

In general, though, hmatrix looks *very* well written. I'd have good
confidence in this library for performance work.

-- Don

Re: [Haskell-cafe] "sum" in hmatrix and blas?

Don Stewart