
Wow, now performance is on par with Java ;)So slow division was main problem, that and GC . Thanks!
From: daniel.is.fischer@googlemail.com To: haskell-cafe@haskell.org CC: bmaxa@hotmail.com Subject: Re: [Haskell-cafe] Why is boxed mutable array so slow? Date: Sat, 1 Dec 2012 21:12:29 +0100
On Samstag, 1. Dezember 2012, 16:09:05, Branimir Maksimovic wrote:
All in all even unboxed array is about 10 times slower than Java version. I don't understand why is even unboxed array so slow.
It's not the unboxed arrays that are slow.
Your code has a couple of weak spots, and GHC's native code generator has a weakness that bites here.
On my box, I don't quite have a 10× difference to my translation to Java, it's a bit less than 7× (0.82s vs 0.12s - I don't want to bring my box to its knees by running something that takes 3GB+ of RAM, so I run the unboxed array part only) with the LLVM backend and 8× (0.93s) with the native code generator. That's in the same ballpark, though.
So what's the deal?
Main.main_$s$wa1 [Occ=LoopBreaker] :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.State# GHC.Prim.RealWorld -> GHC.Types.Int -> GHC.Types.Int -> GHC.Types.Int -> ...
Your loops carry boxed Ints around, that's always a bad sign. In this case it doesn't hurt too much, however, since these values are neither read nor substituted during the loop (they're first and last index of the array and number of elements). Additionally, they carry an IOUArray constructor around. That is unnecessary. Eliminating a couple of dead parameters
init' a = do (_,n) <- getBounds a let init k | k > n = return () | otherwise = do let x = fromIntegral $ k + k `div` 3 unsafeWrite a k x init (k+1) init 0
partial_sum a = do (_,n) <- getBounds a let ps i s | i > n = return () | otherwise = do k <- unsafeRead a i let l = s + k unsafeWrite a i l ps (i+1) l k <- unsafeRead a 0 ps 1 k
brings the time for the native code generator down to 0.82s, and for the LLVM backend the time remains the same.
Next problem, you're using `div` for the division.
`div` does some checking and potentially fixup (not here, since everything is non-negative) after the machine division because `div` is specified to satisfy
a = (a `div` b) * b + (a `mod` b)
with 0 <= a `mod` b < abs b.
That is in itself slower than the pure machine division you get with quot.
So let's see what we get with `quot`.
0.65s with the native code generator, and 0.13 with the LLVM backend.
Whoops, what's that?
The problem is, as can be seen by manually replacing k `quot` 3 with
(k *2863311531) `shiftR` 33
(requires 64-bit Ints; equivalent in Java: k*28..1L >> 33), when the native backend, the LLVM backend and Java (as well as C) all take more or less the same time [well, the NCG is a bit slower than the other two, 0.11s, 0.11s, 0.14s], that division is a **very** slow operation.
Java and LLVM know how to replace the division by the constant 3 with a mulitplication, a couple of shifts and an addition (since we never have negative numbers here, just one multiplication and shift suffice, but neither Java nor LLVM can do that on their own because it's not guaranteed by the type). The native code generator doesn't - not yet.
So the programme spends the majority of the time dividing. The array reads and writes are on par with Java's (and, for that matter, C's).
If you make the divisor a parameter instead of a compile time constant, the NCG is not affected at all, the LLVM backend gives you equal performance (it can't optimise a division by a divisor it doesn't know). Java is at an advantage there, after a while the JIT sees that it might be a good idea to optimise the division and so its time only trebles.