Re: [Haskell-cafe] Profiling slow multiplication of elements read from vector

12 Dec 2012

      Hi,

Am Dienstag, den 11.12.2012, 15:25 +0100 schrieb Richard Janis Beckert:
...
I would be very happy for some input on this, because I am pretty new to
Haskell and I don't really know how to do proper profiling.
by looking at the core (-ddump-simpl) I found a few issues.

neighProd vs l i = foldl' (\acc x -> acc * (lukUp vs l i) ) 1 [i-2
                                                               ,i-1
                                                               ,i
                                                               ,i+1
                                                               ,i+2]

will, for every call to neighProd, create a list of boxed integers and
fold over it. I expected this boxing and unboxing to matter, so I
switched to:

neighProd vs l i = lukUp vs l (i-2) * lukUp vs l (i-1) *
		lukUp vs l (i) * lukUp vs l (i+1) * lukUp vs l (i+2)

This will just allocate five machine integers, fills them and then
multiplies them. Surprisingly, I did not notice a speed up.

Reading more code I noticed that GHC did not figure out that the
parameter i to neighProd is always required, and hence it is passed in
as a boxed Integer. Again, boxing and unboxing costs time. Here

neighProd vs l !i = lukUp vs l (i-2) * lukUp vs l (i-1) *
		lukUp vs l (i) * lukUp vs l (i+1) * lukUp vs l (i+2)

fares better.

There are more problems which I am not sure how to get rid of. For
example, every call to div in lukUp makes a check if l is zero _and_ the
indexing does bounds checking – the compiler cannot infer that just
because you div’ed by l the bounds are guaranteed to be good.

Using V.unsafeIndex there gives another noticable speedup.

But not every Bang Pattern that avoids boxing gives a noticable speedup.
E.g. the Bang Pattern in

lukUp :: V.Vector Int -> Int -> Int -> Int
lukUp vs l i = let !i' = i `mod` l
                in V.unsafeIndex vs i'

did, according to core, prevent some boxing, but did not speed up the
program. Allocating Ints that are unused before the next GC are very
cheap.

You can save more time by not passing l around, the size of the vector
is already present in vs and can cheaply be accessed by "V.length vs".

You can prevent the five checks if the lenghts is zero by making that
check yourself first – GHC detects that it is in the branch where there
length is not 0 (and not -1) and optimize that out:

neighProd vs (!i) = case V.length vs
    of 0 -> undefined 
       -1 -> undefined
       _ -> lukUp vs (i-2) * lukUp vs (i-1) * lukUp vs (i)
		 * lukUp vs (i+1) * lukUp vs (i+2)

The speedup is low; maybe later phases of the compiler did that already.

Oh, and you are creating a new byte array in every invocation of
randVec. That is of course expensive. Using mutable arrays seems to be a
more natural choice. Or is that really what you want to do? Judging from
your code I would expect that you want to create a new array consisting
only of values returned by neighProd; with the current code, when
updating the second value you are including the _updated_ first entry in
your product.

Disclaimer: I am compiling GHC 7.6.1-rc1 while testing this, so my
measurements might be unreliable. Best try it out yourself.

Also, this blind-stab-optimization is /not/ best practice, I just
enjoyed fiddling around.

Greetings,
Joachim
-- 
Joachim Breitner
  e-Mail: mail@joachim-breitner.de
  Homepage: http://www.joachim-breitner.de
  Jabber-ID: nomeata@joachim-breitner.de