Vector sort poor performance

Hi! I've got dramatically low sort performamce: about 2 us of time and 5 kB of allocation for unpacked vector of 4 (four) Double. Code: import Data.List import Control.Monad import Control.Monad.ST import qualified Data.Vector.Unboxed as V import qualified Data.Vector.Unboxed.Mutable as MV import qualified Data.Vector.Algorithms.Intro as Intro import Data.IORef arr = V.fromList ([1,2] :: [Double]) foo x = V.head q where q = runST $ do res <- V.unsafeThaw $ V.concat [ V.map (\e -> e + x) arr , V.map (\e -> e - x) arr] Intro.sort res V.unsafeFreeze res main = do ref <- newIORef 0 forM_ [0..100000] $! \i -> do modifyIORef' ref (+(foo $ fromInteger i)) -- for foo not to optimize out readIORef ref >>= print ghc -O2 sort.hs && time ./sort [1 of 1] Compiling Main ( sort.hs, sort.o ) Linking sort ... -4.999949999e9 real 0m0.189s user 0m0.184s sys 0m0.000s Does anybody know what's going on?

On 20-08-2014 21:24, Alexander Pakhomov wrote:
ref <- newIORef 0 forM_ [0..100000] $! \i -> do modifyIORef' ref (+(foo $ fromInteger i)) -- for foo not to optimize out readIORef ref >>= print
ghc -O2 sort.hs && time ./sort
My first recommendation is to use criterion. It will get you a way better idea of the timing needed for your function, specially since it does not do much work. Cheers, -- Felipe.

1) The problem is not a measurement. The whole foo function a bottleneck in my program.
2) Change Vector.Algorithms.Intro.sort to List.sort improves performance 2-3x, remove sort improves performance 30x
C++ is 30x faster on this task. That's strange, because using Vector.Mutable.Unboxed should provide good performance.
And C++ spends all time in vector allocation. With on stack dynamic array it is 400 times faster.
3) Anyway Criterion measurement is 1.6 us (different machine)
Measurement code is:
import Data.List
import Control.Monad
import Control.Monad.ST
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Unboxed.Mutable as MV
import qualified Data.Vector.Algorithms.Intro as Intro
import Data.IORef
import Criterion.Main
arr = V.fromList ([1,2] :: [Double])
foo x = V.head q
where
q = runST $ do
res <- V.unsafeThaw $ V.concat
[ V.map (\e -> e + x) arr
, V.map (\e -> e - x) arr]
Intro.sort res
V.unsafeFreeze res
main = do
ref <- newIORef 0
defaultMain [
bgroup "sort" [ bench "10" $ whnf foo 10]]
21.08.2014, 09:01, "Felipe Lessa"
On 20-08-2014 21:24, Alexander Pakhomov wrote:
ref <- newIORef 0 forM_ [0..100000] $! \i -> do modifyIORef' ref (+(foo $ fromInteger i)) -- for foo not to optimize out readIORef ref >>= print
ghc -O2 sort.hs && time ./sort
My first recommendation is to use criterion. It will get you a way better idea of the timing needed for your function, specially since it does not do much work.
Cheers,
-- Felipe.
,
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I've added a few more variants to your test code (see below) and I'm not sure what to think. Especially if you look at Optimal.sort4ByOffset's source code, it makes no sense that it would take so much time. So my hypothesis is that most of the time is spent on the `compare` calls. If you take a look at the generated core, sort4ByOffset is not inlined, let alone sort4ByIndex. Also, none of these two have SPECIALIZE pragmas for Doubles (or anything else for that matter). GHC is then calling the function with a reference to `compare`, which transforms a few CPU cycles into a function call that constructs a value which is then immediately deconstructed. I suggest that you add a few SPECIALIZE pragmas to vector-algorithms and check its performance again. {-# LANGUAGE Rank2Types #-} import Control.Monad import Control.Monad.ST import qualified Data.Vector.Unboxed as V import qualified Data.Vector.Unboxed.Mutable as MV import qualified Data.Vector.Algorithms.Heap as Heap import qualified Data.Vector.Algorithms.Insertion as Insertion import qualified Data.Vector.Algorithms.Intro as Intro import qualified Data.Vector.Algorithms.Merge as Merge import qualified Data.Vector.Algorithms.Optimal as Optimal import qualified Data.Vector.Algorithms.Radix as Radix import Data.IORef import Criterion.Main type Value = Double arr = V.fromList ([1,2] :: [Value]) foo :: (forall s. V.MVector s Value -> ST s ()) -> Value -> Value foo f x = V.head q where q = runST $ do res <- V.unsafeThaw $ V.concat [ V.map (\e -> e + x) arr , V.map (\e -> e - x) arr] f res V.unsafeFreeze res main = do ref <- newIORef 0 defaultMain [ b "--nothing--" (const $ return ()) , b "--best--" best , b "Heap" Heap.sort , b "Insertion" Insertion.sort , b "Intro" Intro.sort , b "Merge" Merge.sort , b "Optimal" (\v -> Optimal.sort4ByOffset compare v 0) ] where b :: String -> (forall s. V.MVector s Value -> ST s ()) -> Benchmark b s f = bgroup s [ bench "10" $ whnf (foo f) 10 ] best :: forall s. V.MVector s Value -> ST s () best res = do -- [ 11, 12, -9, -8 ] MV.swap res 0 2 -- [ -9, 12, 11, -8 ] MV.swap res 1 3 -- [ -9, -8, 11, 12 ] -- Felipe.

On 21-08-2014 11:22, Felipe Lessa wrote:
I suggest that you add a few SPECIALIZE pragmas to vector-algorithms and check its performance again.
Scratch that, I remembered that one may just copy & paste :). Here's the code and criterion results: https://gist.github.com/meteficha/1acef6dc1e1ed81b63ae This is the relevant part: benchmarking --nothing--/10 time 151.5 ns (151.2 ns .. 151.8 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 151.8 ns (151.6 ns .. 152.1 ns) std dev 868.8 ps (732.4 ps .. 1.030 ns) benchmarking --best--/10 time 157.4 ns (157.1 ns .. 157.7 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 157.1 ns (156.6 ns .. 157.4 ns) std dev 1.297 ns (786.7 ps .. 2.366 ns) benchmarking Optimal/10 time 1.173 μs (1.168 μs .. 1.180 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.194 μs (1.187 μs .. 1.204 μs) std dev 27.45 ns (22.18 ns .. 36.34 ns) variance introduced by outliers: 29% (moderately inflated) benchmarking Optimal'/10 time 157.7 ns (157.5 ns .. 158.0 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 157.8 ns (157.5 ns .. 158.2 ns) std dev 1.178 ns (872.4 ps .. 1.965 ns) Optimal', which is a copy of Optimal, takes almost the same as --best--, which does no comparisons at all! So vector-algorithms really needs either to force INLINE (leading to code size bloat) or sprinkle SPECIALIZE pragmas everywhere. I suggest contacting the package's maintainer to see what their thoughts are on this matter. Cheers, -- Felipe.

Hi Feilpe,
Which GHC version do you use?
With 7.6.3 I have all the same BUT Optimal'/10 260 ns. It's better than 1200 ns
but still bad. Also there are INLINEs everywhere at vector-algorithms.
Seems to be compiler problem.
21.08.2014, 18:32, "Felipe Lessa"
On 21-08-2014 11:22, Felipe Lessa wrote:
I suggest that you add a few SPECIALIZE pragmas to vector-algorithms and check its performance again.
Scratch that, I remembered that one may just copy & paste :).
Here's the code and criterion results: https://gist.github.com/meteficha/1acef6dc1e1ed81b63ae
This is the relevant part:
benchmarking --nothing--/10 time 151.5 ns (151.2 ns .. 151.8 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 151.8 ns (151.6 ns .. 152.1 ns) std dev 868.8 ps (732.4 ps .. 1.030 ns)
benchmarking --best--/10 time 157.4 ns (157.1 ns .. 157.7 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 157.1 ns (156.6 ns .. 157.4 ns) std dev 1.297 ns (786.7 ps .. 2.366 ns)
benchmarking Optimal/10 time 1.173 μs (1.168 μs .. 1.180 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.194 μs (1.187 μs .. 1.204 μs) std dev 27.45 ns (22.18 ns .. 36.34 ns) variance introduced by outliers: 29% (moderately inflated)
benchmarking Optimal'/10 time 157.7 ns (157.5 ns .. 158.0 ns) 1.000 R² (1.000 R² .. 1.000 R²) mean 157.8 ns (157.5 ns .. 158.2 ns) std dev 1.178 ns (872.4 ps .. 1.965 ns)
Optimal', which is a copy of Optimal, takes almost the same as --best--, which does no comparisons at all!
So vector-algorithms really needs either to force INLINE (leading to code size bloat) or sprinkle SPECIALIZE pragmas everywhere. I suggest contacting the package's maintainer to see what their thoughts are on this matter.
Cheers,
-- Felipe.

On Thu, Aug 21, 2014 at 3:51 PM, Alexander Pakhomov
Which GHC version do you use? With 7.6.3 I have all the same BUT Optimal'/10 260 ns. It's better than 1200 ns but still bad. Also there are INLINEs everywhere at vector-algorithms. Seems to be compiler problem.
There's a lot of sensitivity in the compiler to exactly how and which rules are specified. I already brought up this regression with Dan a few weeks ago, and we didn't come up with a good way to deal with it. I don't think it's quite fair to call it a compiler problem, so much as just a change that caused a previously-good-but-fragile behaviour to now not be so good.

To be more specific...
GHC used to inline the code you're exercising, which allows it to
specialize to the comparison function, which allows it to keep all the
numbers unboxed. But the current version of GHC doesn't do this, and if I
force it to inline, the compiler will just die with a 'simplifier ticks
exhausted' message.
I've been thinking about ways to avoid this situation, and have some ideas
(one of which would work for your use case). But I guess in the mean time I
should turn off use of these functions.
On Thu, Aug 21, 2014 at 7:18 PM, Bryan O'Sullivan
On Thu, Aug 21, 2014 at 3:51 PM, Alexander Pakhomov
wrote: Which GHC version do you use? With 7.6.3 I have all the same BUT Optimal'/10 260 ns. It's better than 1200 ns but still bad. Also there are INLINEs everywhere at vector-algorithms. Seems to be compiler problem.
There's a lot of sensitivity in the compiler to exactly how and which rules are specified. I already brought up this regression with Dan a few weeks ago, and we didn't come up with a good way to deal with it. I don't think it's quite fair to call it a compiler problem, so much as just a change that caused a previously-good-but-fragile behaviour to now not be so good.
participants (4)
-
Alexander Pakhomov
-
Bryan O'Sullivan
-
Dan Doel
-
Felipe Lessa