Are there performant mutable Arrays in Haskell?

Hey There, I am trying to write a hash-algorithm that in fact is working, but as you might have guessed the problem is the performance :) At the moment I am 40 times worse than the same implementation in C. My problem is, I need mutable arrays which are the heart of that hash. The algorithm is round-based and within each round I need the value of certain positions in that array. At the moment I work with an unboxed ST-Array at the crucial part --- snip --- test list = runSTUArray $ do a_arr <- newListArray (0, 1752) list let round i = do ain <- readArray a_arr (i-n) at0 <- readArray a_arr (i-t0) at1 <- readArray a_arr (i-t1) at2 <- readArray a_arr (i-t2) at3 <- readArray a_arr (i-t3) at4 <- readArray a_arr (i-t4) let li = ls $ (i - n) mod 16 ri = rs $ (i - n) mod 16 writeArray a_arr i $ (step4 li) . (step3 ri) . (step2 at1 at2 at3 at4) $ (step1 ain at0 i) mapM_ round [n..(n+t-1)] return a_arr --- snap --- I also played around with peekElemOff and pokeElemOff, but this isn't much faster. It took ~30sec with my Haskell-Code while the C-implementation need only ~1sec. Is there someone who may have some hints for me, what I can do better, or just lead me in the right direction? :) Thanks a lot in advance -- Matthias

Hello Matthias, Thursday, March 19, 2009, 2:16:30 PM, you wrote: 1. use ghc -O2 2. use unsafeRead/WriteArrray 3. check that all calculations (step*, li/ri) are strict
Hey There,
I am trying to write a hash-algorithm that in fact is working, but as you might have guessed the problem is the performance :) At the moment I am 40 times worse than the same implementation in C.
My problem is, I need mutable arrays which are the heart of that hash. The algorithm is round-based and within each round I need the value of certain positions in that array.
At the moment I work with an unboxed ST-Array at the crucial part
--- snip --- test list = runSTUArray $ do a_arr <- newListArray (0, 1752) list
let round i = do ain <- readArray a_arr (i-n) at0 <- readArray a_arr (i-t0) at1 <- readArray a_arr (i-t1) at2 <- readArray a_arr (i-t2) at3 <- readArray a_arr (i-t3) at4 <- readArray a_arr (i-t4) let li = ls $ (i - n) mod 16 ri = rs $ (i - n) mod 16
writeArray a_arr i $ (step4 li) . (step3 ri) . (step2 at1 at2 at3 at4) $ (step1 ain at0 i) mapM_ round [n..(n+t-1)]
return a_arr --- snap ---
I also played around with peekElemOff and pokeElemOff, but this isn't much faster. It took ~30sec with my Haskell-Code while the C-implementation need only ~1sec.
Is there someone who may have some hints for me, what I can do better, or just lead me in the right direction? :)
Thanks a lot in advance
-- Matthias _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Thx for your hints, I played around with them and the performance gets slightly better. But the major boost is still missing :) I noticed, that one real bottleneck seems to be the conversion of the array back into a list. The interesting part is, if I use the elems function (Data.Array.Base) the performance is about 4x better then with my own function. So I thought, I write my own version of elems, (that just converts a part of the array to a list) and I fall back into the same performance as my first approach. To make a long story short, here is the library code: elems arr = case bounds arr of (_l, _u) -> [unsafeAt arr i | i <- [0 .. numElements arr - 1] And my version: boundedElems arr = case bounds arr of (_l, _u) -> [unsafeAt arr i | i <- [1737 .. 1752]] Is there a reason, why the library version is 4 times faster, than mine?

"Brettschneider, Matthias"
Thx for your hints, I played around with them and the performance gets slightly better. But the major boost is still missing :)
I noticed, that one real bottleneck seems to be the conversion of the array back into a list. The interesting part is, if I use the elems function (Data.Array.Base) the performance is about 4x better then with my own function. So I thought, I write my own version of elems, (that just converts a part of the array to a list) and I fall back into the same performance as my first approach.
To make a long story short, here is the library code: elems arr = case bounds arr of (_l, _u) -> [unsafeAt arr i | i <- [0 .. numElements arr - 1]
And my version: boundedElems arr = case bounds arr of (_l, _u) -> [unsafeAt arr i | i <- [1737 .. 1752]]
Is there a reason, why the library version is 4 times faster, than mine?
Uhhhmmm... I've got no idea. But as the list is constructed lazily, you shouldn't expect a real speedup by being lazy manually, anyway. You might want to try the stream-fusion list implementation of lists, or get rid of lists, alltogether. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

barsoap:
"Brettschneider, Matthias"
wrote: Thx for your hints, I played around with them and the performance gets slightly better. But the major boost is still missing :)
I noticed, that one real bottleneck seems to be the conversion of the array back into a list. The interesting part is, if I use the elems function (Data.Array.Base) the performance is about 4x better then with my own function. So I thought, I write my own version of elems, (that just converts a part of the array to a list) and I fall back into the same performance as my first approach.
To make a long story short, here is the library code: elems arr = case bounds arr of (_l, _u) -> [unsafeAt arr i | i <- [0 .. numElements arr - 1]
And my version: boundedElems arr = case bounds arr of (_l, _u) -> [unsafeAt arr i | i <- [1737 .. 1752]]
Is there a reason, why the library version is 4 times faster, than mine?
Uhhhmmm... I've got no idea. But as the list is constructed lazily, you shouldn't expect a real speedup by being lazy manually, anyway. You might want to try the stream-fusion list implementation of lists, or get rid of lists, alltogether.
Yes, avoid constructing the list, if you're not doing that in C.

"Brettschneider, Matthias"
Thx for your hints, I played around with them and the performance gets slightly better. But the major boost is still missing :)
I noticed, that one real bottleneck seems to be the conversion of the array back into a list. The interesting part is, if I use the elems function (Data.Array.Base) the performance is about 4x better then with my own function. So I thought, I write my own version of elems, (that just converts a part of the array to a list) and I fall back into the same performance as my first approach.
To make a long story short, here is the library code: elems arr = case bounds arr of (_l, _u) -> [unsafeAt arr i | i <- [0 .. numElements arr - 1]
And my version: boundedElems arr = case bounds arr of (_l, _u) -> [unsafeAt arr i | i <- [1737 .. 1752]]
Is there a reason, why the library version is 4 times faster, than mine?
There shouldn't be any reason. Try putting {-# INLINE boundedElems #-} to make it inline, it might be faster. -- c/* __o/* <\ * (__ */\ <

To make a long story short, here is the library code: elems arr = case bounds arr of (_l, _u) -> [unsafeAt arr i | i <- [0 .. numElements arr - 1]
And my version: boundedElems arr = case bounds arr of (_l, _u) -> [unsafeAt arr i | i <- [1737 .. 1752]]
Is there a reason, why the library version is 4 times faster, than mine?
There shouldn't be any reason. Try putting
{-# INLINE boundedElems #-}
to make it inline, it might be faster.
Ah, ok, did that, but no difference. :) If I got it right, for the library version exists a C-translation while for my version there isn't.

Brettschneider:
Hey There,
I am trying to write a hash-algorithm that in fact is working, but as you might have guessed the problem is the performance :) At the moment I am 40 times worse than the same implementation in C.
My problem is, I need mutable arrays which are the heart of that hash. The algorithm is round-based and within each round I need the value of certain positions in that array.
At the moment I work with an unboxed ST-Array at the crucial part
--- snip --- test list = runSTUArray $ do a_arr <- newListArray (0, 1752) list
let round i = do ain <- readArray a_arr (i-n) at0 <- readArray a_arr (i-t0) at1 <- readArray a_arr (i-t1) at2 <- readArray a_arr (i-t2) at3 <- readArray a_arr (i-t3) at4 <- readArray a_arr (i-t4) let li = ls $ (i - n) mod 16 ri = rs $ (i - n) mod 16
writeArray a_arr i $ (step4 li) . (step3 ri) . (step2 at1 at2 at3 at4) $ (step1 ain at0 i) mapM_ round [n..(n+t-1)]
return a_arr --- snap ---
I also played around with peekElemOff and pokeElemOff, but this isn't much faster. It took ~30sec with my Haskell-Code while the C-implementation need only ~1sec.
Is there someone who may have some hints for me, what I can do better, or just lead me in the right direction? :)
You should be able to directly translate C into approximmately the same kind of code using STUArrays. Make sure to compile with -O2 and use unsafeRead/Write where you also do unchecked read/writes in C. An example: http://shootout.alioth.debian.org/gp4/benchmark.php?test=nsievebits&lang=ghc&id=1 -- Don
participants (5)
-
Achim Schneider
-
Brettschneider, Matthias
-
Bulat Ziganshin
-
Don Stewart
-
Xiao-Yong Jin