Profiling slow multiplication of elements read from vector

Hello everybody! For testing purposes, I punched down a small program which... + puts 2^n elements into an unmutable vector (fromList); + generates a random index in the vector (using random-mersenne); + reads the value at the index i and at i+{-2,-1,1,2} and makes product of these values (making sure the indices stay within the boundary of the vector using `mod 2^n'); + makes a new vector with the product replacing the element at the index; + repeats the whole process 10^7 times starting from the random index. Now, after enforcing a minimum of strictness and profiling the program, I find that for some odd reason the program spends most of its time calculating the product of the read numbers! This is completely contrary to what I expected; namely, that it would spend most of the time either a) generating random numbers, b) reading the vector, or c) making a new vector. But it turns out, that reading the vector or indeed making the new one takes only a fraction of the actual time. I have attached the program and the profile to this email. Could it be that profiling assigns the Cost Centres to the wrong parts of the program? Did I overlook something crucial? Most likely the program could be implemented in a smarter way, but it's pretty close to what my actual program looks like and it shows similar behaviour. (My code, additionally, has a Boltzmann-weighed probability condition on whether the new value gets written to the vector or not.) I would be very happy for some input on this, because I am pretty new to Haskell and I don't really know how to do proper profiling. Cheers Janis (This being my first email to this mailing list, I hope I adhere to all rules of conduct out there; please tell me if I should have written this email differently or more concise.)

Hi, Am Dienstag, den 11.12.2012, 15:25 +0100 schrieb Richard Janis Beckert:
I would be very happy for some input on this, because I am pretty new to Haskell and I don't really know how to do proper profiling.
by looking at the core (-ddump-simpl) I found a few issues. neighProd vs l i = foldl' (\acc x -> acc * (lukUp vs l i) ) 1 [i-2 ,i-1 ,i ,i+1 ,i+2] will, for every call to neighProd, create a list of boxed integers and fold over it. I expected this boxing and unboxing to matter, so I switched to: neighProd vs l i = lukUp vs l (i-2) * lukUp vs l (i-1) * lukUp vs l (i) * lukUp vs l (i+1) * lukUp vs l (i+2) This will just allocate five machine integers, fills them and then multiplies them. Surprisingly, I did not notice a speed up. Reading more code I noticed that GHC did not figure out that the parameter i to neighProd is always required, and hence it is passed in as a boxed Integer. Again, boxing and unboxing costs time. Here neighProd vs l !i = lukUp vs l (i-2) * lukUp vs l (i-1) * lukUp vs l (i) * lukUp vs l (i+1) * lukUp vs l (i+2) fares better. There are more problems which I am not sure how to get rid of. For example, every call to div in lukUp makes a check if l is zero _and_ the indexing does bounds checking – the compiler cannot infer that just because you div’ed by l the bounds are guaranteed to be good. Using V.unsafeIndex there gives another noticable speedup. But not every Bang Pattern that avoids boxing gives a noticable speedup. E.g. the Bang Pattern in lukUp :: V.Vector Int -> Int -> Int -> Int lukUp vs l i = let !i' = i `mod` l in V.unsafeIndex vs i' did, according to core, prevent some boxing, but did not speed up the program. Allocating Ints that are unused before the next GC are very cheap. You can save more time by not passing l around, the size of the vector is already present in vs and can cheaply be accessed by "V.length vs". You can prevent the five checks if the lenghts is zero by making that check yourself first – GHC detects that it is in the branch where there length is not 0 (and not -1) and optimize that out: neighProd vs (!i) = case V.length vs of 0 -> undefined -1 -> undefined _ -> lukUp vs (i-2) * lukUp vs (i-1) * lukUp vs (i) * lukUp vs (i+1) * lukUp vs (i+2) The speedup is low; maybe later phases of the compiler did that already. Oh, and you are creating a new byte array in every invocation of randVec. That is of course expensive. Using mutable arrays seems to be a more natural choice. Or is that really what you want to do? Judging from your code I would expect that you want to create a new array consisting only of values returned by neighProd; with the current code, when updating the second value you are including the _updated_ first entry in your product. Disclaimer: I am compiling GHC 7.6.1-rc1 while testing this, so my measurements might be unreliable. Best try it out yourself. Also, this blind-stab-optimization is /not/ best practice, I just enjoyed fiddling around. Greetings, Joachim -- Joachim Breitner e-Mail: mail@joachim-breitner.de Homepage: http://www.joachim-breitner.de Jabber-ID: nomeata@joachim-breitner.de

Hey! Thanks a lot for your reply!
Disclaimer: I am compiling GHC 7.6.1-rc1 while testing this, so my measurements might be unreliable. Best try it out yourself.
Also, this blind-stab-optimization is /not/ best practice, I just enjoyed fiddling around.
What /is/ best practice in regards to performance optimization of code such as mine? What other options than blind stabbing do I have? As for applying your suggestions: I have improved the speed of my overall program by about 15% (see the new profiling I have obtained), but it still seems as if the program spends too much time reading from the vector and multiplying 5 integers --- but probably the cost centers are allocated wrongly? I tried various combinations of the changes you suggest, and had some very interesting results. It seems like there are some performance differences using GHC 7.4.2 (which I use due to cabal-dev) and GHC 7.6.*. Using my initial code, I get a runtime of around 15.7s. On 12/11/12 at 10:43pm, Joachim Breitner wrote:
neighProd vs l !i = lukUp vs l (i-2) * lukUp vs l (i-1) * lukUp vs l (i) * lukUp vs l (i+1) * lukUp vs l (i+2)
Using that code rather than a map, actually leads to slower results! I am getting an average of 18.0s for this. Interestingly, passing the index `!i' as an unboxed Int leads to a time-increase in my map version --- 16.8s. However:
Using V.unsafeIndex there gives another noticable speedup.
This helps a lot, giving me 14.8s and 15.1 for your suggestion; but combining it with
lukUp :: V.Vector Int -> Int -> Int -> Int lukUp vs l i = let !i' = i `mod` l in V.unsafeIndex vs i'
did, according to core, prevent some boxing, but did not speed up the program. Allocating Ints that are unused before the next GC are very cheap.
gives me a time of about 12.8s with your suggestions (explicit calls to `lukUp', unsafeIndex'ing), which is the fastest result I got so far!
You can save more time by not passing l around, the size of the vector is already present in vs and can cheaply be accessed by "V.length vs".
I decided to pass l around explicitly in order to safe on expensive length-calculations. As it turns out, checking the bounds gives me a decrease in speed in all cases. Furthermore, using `mod' is actually a feature of my underlying model, as it enforces periodic boundary conditions (think of the vector representing a circle). Therefore the boundary checking just happens accidentally.
Oh, and you are creating a new byte array in every invocation of randVec. That is of course expensive. Using mutable arrays seems to be a more natural choice. Or is that really what you want to do? Judging from your code I would expect that you want to create a new array consisting only of values returned by neighProd; with the current code, when updating the second value you are including the _updated_ first entry in your product.
I will try to use a mutable vector to see if I can speed the thing up further, since I do indeed want to update my vector one element at a time, taking into account the elements 4 nearest and next-to-nearest neighbours (as you can see, every iteration of the code acts on the updated vector). I just decided to go with an unmutable structure (a list at first, followed by a Data.Map structure), in order to work without side effects. I'd be happy to get more input about how to increase the performance of my code! Cheers Janis
participants (2)
-
Joachim Breitner
-
Richard Janis Beckert