Performance of concurrent array access

Hi, I am getting my feet writing concurrent programs in Haskell with GHC for multicore machines. As a first step I decided to write a program that reads and writes concurrently to an IOArray, with no synchronization whatsoever. I'm doing this to establish a baseline to compare with the performance of other data structures that do use appropriate synchronization mechanisms. I ran in to some surprising results, namely that in many cases, I am not getting any speed up at all. Here are the details... I write a couple variations on a single program. The main idea is that I wrote a DirectAddressTable data structure, which is simply a wrapper around an IOArray providing insert and lookup methods: -- file DirectAddressTable.hs module DirectAddressTable ( DAT , newDAT , lookupDAT , insertDAT , getAssocsDAT ) where import Data.Array.IO import Data.Array.MArray import Data.Int (Int32) data DAT = DAT (IOArray Int32 Char) -- create a fixed size array; missing keys have value '-'. newDAT :: Int32 -> IO (DAT) newDAT n = do a <- newArray (0, n - 1) '-' return (DAT a) -- lookup an item. lookupDAT :: DAT -> Int32 -> IO (Maybe Char) lookupDAT (DAT a) i = do c <- readArray a i return (if c=='-' then Nothing else Just c) -- insert an item insertDAT :: DAT -> Int32 -> Char -> IO () insertDAT (DAT a) i v = writeArray a i v -- get all associations (exclude missing items, i.e. those whose value is '-'). getAssocsDAT :: DAT -> IO [(Int32,Char)] getAssocsDAT (DAT a) = do assocs <- getAssocs a return [ (k,c) | (k,c) <- assocs, c /= '-' ] I then have a main program that initializes a new table, forks some threads, with each thread writing and reading some fixed number of values to the just initialized table. The overall number of elements to write is fixed. The number of threads to use is a taken from a command line argument, and the elements to process are evenly divided among the threads. -- file DirectTableTest.hs import DirectAddressTable import Data.Int (Int32) import Control.Concurrent import Control.Parallel import System.Environment main = do args <- getArgs let numThreads = read (args !! 0) vs <- sequence (replicate numThreads newEmptyMVar) a <- newDAT arraySize sequence_ [ forkIO (doLotsOfStuff numThreads i a >>= putMVar v) | (i,v) <- zip [1..] vs] sequence_ [ takeMVar v >>= \a -> getAssocsDAT a >>= \xs -> print (last xs) | v <- vs] doLotsOfStuff :: Int -> Int -> DAT -> IO DAT doLotsOfStuff numThreads i a = do let p j c = insertDAT a j c >> lookupDAT a j >>= \v -> v `pseq` return () sequence_ [ p j c | (j,c) <- bunchOfKeys j ] return a where bunchOfKeys i = take numElems $ zip cyclicIndices $ drop i cyclicChars numElems = numberOfElems `div` numThreads cyclicIndices = cycle [0..highestIndex] cyclicChars = cycle chars chars = ['a'..'z'] -- Parameters arraySize :: Int32 arraySize = 100 highestIndex = arraySize - 1 numberOfElems = 10 * 1000 * 1000 I compiled this with "ghc --make -rtsopts -threaded -fforce-recomp -O2 DirectTableTest.hs". Running "time ./DirectTableTest 1 +RTS -N1" takes about 1.4 seconds and running "time ./DirectTableTest 2 +RTS -N2" take about 2.0 seconds! Using one more core than worker threads is a little better, with "time ./DirectTableTest 1 +RTS -N1" takes about 1.4 seconds and running "time ./DirectTableTest 1 +RTS -N2" and "time ./DirectTableTest 2 +RTS -N3" both taking about 1.4 seconds. Running with the "-N2 -s" option shows that productivity is 95.4% and GC is 4.3%. Looking at a run of the program with ThreadScope I don't see anything too alarming. Each HEC yields once per ms when a GC occurs. Running with 4 cores gives a time of about 1.2 seconds, which is at least a little better than 1 core. More cores doesn't improve over this. I found that changing the array type used in the implementation of DirectAddressTable from IOArray to IOUArray fixes this problem. With this change, the running time of "time ./DirectTableTest 1 +RTS -N1" is takes about 1.4 seconds whereas the running "time ./DirectTableTest 2 +RTS -N2" is about 1.0 seconds. Increasing to 4 cores gives a run time of 0.55 seconds. Running with "-s" shows a GC time of %3.9 percent. Under ThreadScope I can see that both threads yield every 0.4 ms, more frequently than in the previous program. Finally, I tried one more variation. Instead of having the threads work on the same shared array, I had each thread work on its own array. This scales nicely (as you would expect), more or less like the second program, with either IOArray or IOUArray implementing the DirectAddressTable data structure. I understand why IOUArray might perform better than IOArray, but I don't know why it scales better to multiple threads and cores. Does anyone know why this might be happening or what I can do to find out what is going on? Regards, Andreas

On 23/08/2011 09:04 PM, Andreas Voellmy wrote:
I compiled this with "ghc --make -rtsopts -threaded -fforce-recomp -O2 DirectTableTest.hs". Running "time ./DirectTableTest 1 +RTS -N1" takes about 1.4 seconds and running "time ./DirectTableTest 2 +RTS -N2" take about 2.0 seconds!
I found that changing the array type used in the implementation of DirectAddressTable from IOArray to IOUArray fixes this problem. With this change, the running time of "time ./DirectTableTest 1 +RTS -N1" is takes about 1.4 seconds whereas the running "time ./DirectTableTest 2 +RTS -N2" is about 1.0 seconds. Increasing to 4 cores gives a run time of 0.55 seconds.
Finally, I tried one more variation. Instead of having the threads work on the same shared array, I had each thread work on its own array. This scales nicely (as you would expect), more or less like the second program, with either IOArray or IOUArray implementing the DirectAddressTable data structure.
I understand why IOUArray might perform better than IOArray, but I don't know why it scales better to multiple threads and cores. Does anyone know why this might be happening or what I can do to find out what is going on?
I haven't deeply studied your code. However, I'm going to take a guess this has to do with strictness. By using an IOArray, you're probably filling each cell with a reference to "drop 7 cyclicChars" or similar, which then only gets evaluated (in one thread) when you call "print". By using an IOUArray, you're definitely forcing each character to be computed right away, by the thread doing the writing. That's /probably/ what the difference is. As a guess. (Not sure how you can easily prove/disprove the theory though.) You don't say which GHC version, but AFAIK recent releases of GHC have a seperate heap per thread (or was it per capability?), which probably makes a difference if you start giving each thread its own array. That and just plain ordinary cache coherance issues...

On Tue, Aug 23, 2011 at 16:04, Andreas Voellmy
I found that changing the array type used in the implementation of DirectAddressTable from IOArray to IOUArray fixes this problem.
Since the main observable effect of this change is strictness, I'd immediately suspect you're storing unevaluated thunks in the array which get forced when printed instead of in their appropriate threads. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

On Tue, Aug 23, 2011 at 10:04 PM, Andreas Voellmy
data DAT = DAT (IOArray Int32 Char)
Try to make this a newtype instead. The data type adds a level of indirection.
do let p j c = insertDAT a j c >> lookupDAT a j >>= \v -> v `pseq` return ()
You most likely want (insertDAT a j $! c) to make sure that the element is force, to avoid thunks building up in the array.
-- Parameters arraySize :: Int32
Int might work better than Int32. While they should behave the same on 32-bit machines Int might have a few more rewrite rules that makes it optimize better. -- Johan

Thanks for the suggestions. I tried to add strictness in the following
ways:
(1) Changing "insertDAT a j c" to "insertDAT a j $! c"
(2) Changing "insertDAT a j c" to "deepseq c (insertDAT a j c)"
I also used Int instead of Int32 throughout and changed the DAT data type to
a newtype definition. These changes improved the performance slightly, but
still, the multithreaded runs perform significantly worse than the
single-threaded runs, by about the same amount (i.e. 0.5 seconds more for
the 2 core run than for the 1 core run).
I used ghc 7.0.3 for the performance measurements I gave in my message. I've
also tried under 7.2.1, and I get basically the same behavior there.
--Andreas
On Tue, Aug 23, 2011 at 4:38 PM, Johan Tibell
On Tue, Aug 23, 2011 at 10:04 PM, Andreas Voellmy
wrote: data DAT = DAT (IOArray Int32 Char)
Try to make this a newtype instead. The data type adds a level of indirection.
do let p j c = insertDAT a j c >> lookupDAT a j >>= \v -> v `pseq` return ()
You most likely want (insertDAT a j $! c) to make sure that the element is force, to avoid thunks building up in the array.
-- Parameters arraySize :: Int32
Int might work better than Int32. While they should behave the same on 32-bit machines Int might have a few more rewrite rules that makes it optimize better.
-- Johan

One more observation... I tried a third variation in which the test program
still uses a single shared IOArray but each thread writes to different
indices in the array. In this case I get good scaling with performance
similar to the use of IOUArray. In detail, I made the following two changes
to give each thread a disjoint set of indices to write to:
bunchOfKeys threadNum = take numElems $ zip (cycle $ indices numThreads
threadNum) $ drop threadNum cyclicChars
and
indices :: Int -> Int -> [Int]
indices numThreads threadNum =
let numixs = arraySize `div` numThreads
startIx = numixs * threadNum
allIndices = [0..highestIndex]
in take numixs $ drop startIx allIndices
--Andreas
On Tue, Aug 23, 2011 at 5:07 PM, Andreas Voellmy
Thanks for the suggestions. I tried to add strictness in the following ways:
(1) Changing "insertDAT a j c" to "insertDAT a j $! c" (2) Changing "insertDAT a j c" to "deepseq c (insertDAT a j c)"
I also used Int instead of Int32 throughout and changed the DAT data type to a newtype definition. These changes improved the performance slightly, but still, the multithreaded runs perform significantly worse than the single-threaded runs, by about the same amount (i.e. 0.5 seconds more for the 2 core run than for the 1 core run).
I used ghc 7.0.3 for the performance measurements I gave in my message. I've also tried under 7.2.1, and I get basically the same behavior there.
--Andreas
On Tue, Aug 23, 2011 at 4:38 PM, Johan Tibell
wrote: On Tue, Aug 23, 2011 at 10:04 PM, Andreas Voellmy
wrote: data DAT = DAT (IOArray Int32 Char)
Try to make this a newtype instead. The data type adds a level of indirection.
do let p j c = insertDAT a j c >> lookupDAT a j >>= \v -> v `pseq` return ()
You most likely want (insertDAT a j $! c) to make sure that the element is force, to avoid thunks building up in the array.
-- Parameters arraySize :: Int32
Int might work better than Int32. While they should behave the same on 32-bit machines Int might have a few more rewrite rules that makes it optimize better.
-- Johan

I should have double-checked my work before I sent the last message; I
accidentally benchmarked the wrong program. It turns out that the
modifications I last described do not improve the scaling of the program to
more cores when used with IOArray. And there was a bug: the line "startIx
= numixs * threadNum" should have been "startIx = numixs * (threadNum
- 1)".
--Andreas
On Wed, Aug 24, 2011 at 9:26 AM, Andreas Voellmy
One more observation... I tried a third variation in which the test program still uses a single shared IOArray but each thread writes to different indices in the array. In this case I get good scaling with performance similar to the use of IOUArray. In detail, I made the following two changes to give each thread a disjoint set of indices to write to:
bunchOfKeys threadNum = take numElems $ zip (cycle $ indices numThreads threadNum) $ drop threadNum cyclicChars
and
indices :: Int -> Int -> [Int] indices numThreads threadNum = let numixs = arraySize `div` numThreads startIx = numixs * threadNum allIndices = [0..highestIndex] in take numixs $ drop startIx allIndices
--Andreas
On Tue, Aug 23, 2011 at 5:07 PM, Andreas Voellmy < andreas.voellmy@gmail.com> wrote:
Thanks for the suggestions. I tried to add strictness in the following ways:
(1) Changing "insertDAT a j c" to "insertDAT a j $! c" (2) Changing "insertDAT a j c" to "deepseq c (insertDAT a j c)"
I also used Int instead of Int32 throughout and changed the DAT data type to a newtype definition. These changes improved the performance slightly, but still, the multithreaded runs perform significantly worse than the single-threaded runs, by about the same amount (i.e. 0.5 seconds more for the 2 core run than for the 1 core run).
I used ghc 7.0.3 for the performance measurements I gave in my message. I've also tried under 7.2.1, and I get basically the same behavior there.
--Andreas
On Tue, Aug 23, 2011 at 4:38 PM, Johan Tibell
wrote: On Tue, Aug 23, 2011 at 10:04 PM, Andreas Voellmy
wrote: data DAT = DAT (IOArray Int32 Char)
Try to make this a newtype instead. The data type adds a level of indirection.
do let p j c = insertDAT a j c >> lookupDAT a j >>= \v -> v `pseq` return ()
You most likely want (insertDAT a j $! c) to make sure that the element is force, to avoid thunks building up in the array.
-- Parameters arraySize :: Int32
Int might work better than Int32. While they should behave the same on 32-bit machines Int might have a few more rewrite rules that makes it optimize better.
-- Johan
participants (4)
-
Andreas Voellmy
-
Andrew Coppin
-
Brandon Allbery
-
Johan Tibell