Re: [Haskell-cafe] "sum" in hmatrix and blas?

9 Jun 2008

      Problem solved. The Haskell loop now wins.
...
...
After reading don's blog, I tried to make a test with both
methods.  The code is very simple, as following
module Main where
import System
import Numeric.LinearAlgebra
vsum1 :: Vector Double -> Double
vsum1 v = constant 1 (dim v) <.> v
vsum2 :: Vector Double -> Double
vsum2 v = go (d - 1) 0
    where
      d = dim v
      go :: Int -> Double -> Double
      go 0 s = s + (v @> 0)
      go !j !s = go (j - 1) (s + (v @> j))
mean :: Vector Double -> Double
mean v = vsum v / fromIntegral (dim v)
    where vsum = vsum1
main :: IO ()
main = do
  fn:nrow:ncol:_ <- getArgs
  print . mean . flatten =<< fromFile fn (read nrow, read ncol)
Compile it with "-O2 -optc-O2", and run with a data set of
length 5000000.  The results are
with "vsum1":
 80,077,984 bytes allocated in the heap
      2,208 bytes copied during GC (scavenged)
         64 bytes copied during GC (not scavenged)
 40,894,464 bytes maximum residency (2 sample(s))
  %GC time       0.0%  (0.0% elapsed)
  Alloc rate    35,235,448 bytes per MUT second
./vsum1 huge 5000000 1  2.25s user 0.09s system 99% cpu 2.348 total
This is reasonable, exactly two copies of vector with size
of 40MB.
with "vsum2":
560,743,120 bytes allocated in the heap
     19,160 bytes copied during GC (scavenged)
     15,920 bytes copied during GC (not scavenged)
 40,919,040 bytes maximum residency (2 sample(s))
  %GC time       0.3%  (0.3% elapsed)
  Alloc rate    222,110,261 bytes per MUT second
./mean2 huge 5000000 1  2.53s user 0.06s system 99% cpu 2.598 total
This is strange.  A lot of extra usage of heap?  Probably
because '@>' is not efficient?  So it looks like the
inner-product approach wins with a fairly margin.
...
Looking at the core we get from vsum2, we see:
vsum2 compiles quite well
$wgo_s1Nn :: Int# -> Double# -> Double#
$wgo_s1Nn =
        \ (ww3_s1Mc :: Int#) (ww4_s1Mg :: Double#) ->
          case ww3_s1Mc of ds_X17R {
            __DEFAULT ->
              case Data.Packed.Internal.Vector.$wat
                     @ Double Foreign.Storable.$f9 ww_s1Ms ww1_s1Mu (I# ds_X17R)
              of { D# y_a1KQ ->
              $wgo_s1Nn (-# ds_X17R 1) (+## ww4_s1Mg y_a1KQ)
              };
            0 -> +## ww4_s1Mg ww2_s1M7
          };
    } in  $wgo_s1Nn (-# ww_s1Ms 1) 0.0
But note the return value from $wat is boxed, only to be immediately
unboxed. That looks to be the source of the heap allocations.
Let's see if we can help that.
Ah, of course. The problem is the unsafePerformIO used to wrap up the
'peek'. This blocks the optimisations somewhat, after we unpack the Vector type.. 

So if we rewrite 'safeRead' to not use unsafePerformIO, but instead:

    safeRead v = inlinePerformIO . withForeignPtr (fptr v)

    inlinePerformIO (IO m) = case m realWorld# of (# _, r #) -> r

Along with this change to Vector:

    data Vector t = V { dim  :: {-# UNPACK #-} !Int
                      , fptr :: {-# UNPACK #-} !(ForeignPtr t)
                      }

and some inlining for the bounds checks.
Following bytestrings, we see the performance go from, with a 180M input
file:

Goal:
    vsum1:
        $ ghc-core A.hs -optc-O2 -fvia-C  -fbang-patterns
        $ time ./A data 4000 2500
        0.9999727441161678

        ./A data 4000 2500  5.37s user 0.20s system 99% cpu 5.609 total
        160,081,136 bytes allocated in the heap
        155 Mb total memory in use

Before:
vsum2-old:

         $wgo_s1Ns =
        \ (ww3_s1Me :: Int#) (ww4_s1Mi :: Double#) ->
          case ww3_s1Me of ds_X17Y {
            __DEFAULT ->
              case Data.Packed.Internal.Vector.$wat
                     @ Double Foreign.Storable.$f9 ww_s1Mu ww1_s1Mw (I# ds_X17Y)
              of wild1_a1Hg { D# y_a1Hi ->
              $wgo_s1Ns (-# ds_X17Y 1) (+## ww4_s1Mi y_a1Hi)
              };
            0 -> +## ww4_s1Mi ww2_s1M9
          };
    } in  $wgo_s1Ns (-# ww_s1Mu 1) 0.0

    ./A data 4000 2500 +RTS -sstderr 
    0.999972744115876

    ./A data 4000 2500 +RTS -sstderr  6.04s user 0.15s system 99% cpu 6.203 total    
    1,121,416,400 bytes allocated in the heap
    78 Mb total memory in use

After:
    True ->
          case readDoubleOffAddr#
                 @ RealWorld ww1_s1Nr ds_X180 realWorld#
          of wild2_a1HS { (# s2_a1HU, x_a1HV #) ->
          case touch# @ ForeignPtrContents ww2_s1Ns s2_a1HU
          of s_a1HG { __DEFAULT ->
          $wgo_s1Ok (-# ds_X180 1) (+## ww4_s1Ng x_a1HV)
          }

    No needless boxing.

    $ time ./A data 4000 2500 +RTS -sstderr          
    ./A data 4000 2500 +RTS -sstderr 
    0.999972744115876

    ./A data 4000 2500 +RTS -sstderr  5.41s user 0.10s system 99% cpu 5.548 total
     80,071,072 bytes allocated in the heap

    78 Mb total memory in use

And the total runtime and allocation now is better than the fully 'C' version.
Yay.

Patch attached to the darcs version of hmatrix.
Similar changes can likely be done to the other data types in hmatrix,
for pretty much ideal code.

-- Don

Re: [Haskell-cafe] "sum" in hmatrix and blas?

Don Stewart