
I've just managed to build ghc 6.11 (Thanks Simon). I did this for two reasons, one of which is I want to try to improve the speed of the AI for the Chu Shogi program I am writing by making use of parallel processing. I have a 4-core Xeon runing Fedora Linux 10 (AMD64). I have a repeatable scenario for timings. With ghc 6.10.1, no threaded runtime, the computer takes 51 or 52 seconds to move (experimental variation up to 1.2 seconds). With ghc 6.11, it takes the same time. If I switch on the threaded runtime, then it rises slightly (perhaps not significantly - I measure 53-54 seconds). With 2, 3 or 4 processors, I measured (one-off, so not reliable) 65, 55 and 58 seconds respectively. Well, that doesn't really matter, provided I can get the time back with interest by exploiting parallelism. My first thought for an "easy" win is the move generator. At present I have this code: -- | Generate all legal moves from state. -- -- The best moves (according to some heuristic) should come first generate_moves :: (Non_interactive_state, Maybe Move) -> [(Non_interactive_state, Maybe Move)] generate_moves (state, _) = let colour = case is_black_to_play state of True -> Black False -> White pieces' = all_pieces_of_colour (board state) colour unsorted = concatMap (generate_moves_for_piece state) pieces' sorted = sortBy best_move unsorted moves = mapMaybe new_move sorted in {-trace (concat (intersperse "\n" (map (print_move . fromJust . snd) moves)))-} moves where new_move mv = case update_from_move (state, Nothing) mv of Nothing -> Nothing Just (st', Nothing) -> Just (st', Just mv) and my idea was to run the call to generate_moves_for_piece in parallel over the pieces' list. I thought I could do that simply by replacing concatMap with parFlatMap, but discovered I need to supply an extra argument - a Strategy. Not knowing what this should be (where's the best place to read up on this?) I tried specifying rwhnf. My timings then are: -N1 83 seconds -N2 57 seconds -N3 58 seconds -N4 60 seconds so a complete failure. Where should I go from here? -- Colin Adams Preston Lancashire

Am Mittwoch, 18. März 2009 15:28 schrieb Colin Paul Adams:
I've just managed to build ghc 6.11 (Thanks Simon).
I did this for two reasons, one of which is I want to try to improve the speed of the AI for the Chu Shogi program I am writing by making use of parallel processing. I have a 4-core Xeon runing Fedora Linux 10 (AMD64).
I have a repeatable scenario for timings. With ghc 6.10.1, no threaded runtime, the computer takes 51 or 52 seconds to move (experimental variation up to 1.2 seconds).
With ghc 6.11, it takes the same time.
If I switch on the threaded runtime, then it rises slightly (perhaps not significantly - I measure 53-54 seconds). With 2, 3 or 4 processors, I measured (one-off, so not reliable) 65, 55 and 58 seconds respectively.
Well, that doesn't really matter, provided I can get the time back with interest by exploiting parallelism. My first thought for an "easy" win is the move generator. At present I have this code:
-- | Generate all legal moves from state. -- -- The best moves (according to some heuristic) should come first generate_moves :: (Non_interactive_state, Maybe Move) -> [(Non_interactive_state, Maybe Move)] generate_moves (state, _) = let colour = case is_black_to_play state of True -> Black False -> White pieces' = all_pieces_of_colour (board state) colour unsorted = concatMap (generate_moves_for_piece state) pieces' sorted = sortBy best_move unsorted moves = mapMaybe new_move sorted in {-trace (concat (intersperse "\n" (map (print_move . fromJust . snd) moves)))-} moves where new_move mv = case update_from_move (state, Nothing) mv of Nothing -> Nothing Just (st', Nothing) -> Just (st', Just mv)
and my idea was to run the call to generate_moves_for_piece in parallel over the pieces' list. I thought I could do that simply by replacing concatMap with parFlatMap, but discovered I need to supply an extra argument - a Strategy. Not knowing what this should be (where's the best place to read up on this?) I tried specifying rwhnf.
Control.Parallel.Strategies, docs and source? generate_moves_for_piece produces a list. rwhnf forces this list enough to see if it's [] or (_:_) (rwhnf x = x `seq` ()), that doesn't get enough work done in each thread to compensate the overhead. Try using rnf after writing an NFData instance for Move. If e.g. data Move = Move {from :: Position, to :: Position} , the instance would be instance NFData Move where rnf (Move f t) = rnf f `seq` rnf t `seq` () That might require NFData instances for Position and its components, but specifying these should be automatic.
My timings then are:
-N1 83 seconds -N2 57 seconds -N3 58 seconds -N4 60 seconds
so a complete failure.
Where should I go from here?
HTH, Daniel

"Daniel" == Daniel Fischer
writes:
Daniel> generate_moves_for_piece produces a list. rwhnf forces Daniel> this list enough to see if it's [] or (_:_) (rwhnf x = x Daniel> `seq` ()), that doesn't get enough work done in each Daniel> thread to compensate the overhead. Try using rnf after Daniel> writing an NFData instance for Move. Daniel> If e.g. Daniel> data Move = Move {from :: Position, to :: Position} Daniel> , the instance would be Daniel> instance NFData Move where rnf (Move f t) = rnf f `seq` Daniel> rnf t `seq` () It made no difference to the timings whatsoever. Anyway, at that point I decided to revert to the first rule of performance tuning, and profiled the program for time. The move generator was using less than 3% of the cpu time, so that was clearly a waste of time on my part. The one routine that stood out was this one (about 35% CPU time, with 0% attributed to children): -- | Value of one rank of the board rank_value :: (Int, Array Int Square) -> Int rank_value (rank_coord, rank') = sum (map (cell_value rank_coord) (assocs rank')) The array has 12 elements. So I tried changing the map to parMap rnf. This gave timings of: >> -N1 93 seconds -N2 105 seconds -N3 206 seconds at which point I decided I hadn't got a clue what was going on. -- Colin Adams Preston Lancashire

On 2009 Mar 18, at 16:34, Colin Paul Adams wrote:
The one routine that stood out was this one (about 35% CPU time, with 0% attributed to children):
-- | Value of one rank of the board rank_value :: (Int, Array Int Square) -> Int rank_value (rank_coord, rank') = sum (map (cell_value rank_coord) (assocs rank'))
The array has 12 elements.
How many times do you call it? Perhaps the real optimization you need is to memoize. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

"Brandon" == Brandon S Allbery KF8NH
writes:
>> The array has 12 elements. Brandon> How many times do you call it? Perhaps the real Brandon> optimization you need is to memoize. Very many times indeed. But it is a different array on most occasions (I am not updating in place). It represents one rank of the board on which the game is played. The move generator produces a new board for each move. -- Colin Adams Preston Lancashire

On 2009 Mar 18, at 16:59, Colin Paul Adams wrote:
"Brandon" == Brandon S Allbery KF8NH
writes: The array has 12 elements.
Brandon> How many times do you call it? Perhaps the real Brandon> optimization you need is to memoize.
Very many times indeed. But it is a different array on most occasions (I am not updating in place).
It represents one rank of the board on which the game is played. The move generator produces a new board for each move.
It might be helpful for you to show more of the program. One thing to keep in mind about parallelization is that the level at which happens matters: if individual calculations across the list are fast, all you gain by parallelizing it is MP synchronization/locking overhead. On the other hand, it's possible that if the caller can be rearranged so that rank_value is computed on another CPU while it's doing something else, you could gain quite a bit. Or maybe the caller is itself at the right level to parallelize. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

"Brandon" == Brandon S Allbery KF8NH
writes:
Brandon> On 2009 Mar 18, at 16:59, Colin Paul Adams wrote:
>>>>>>> "Brandon" == Brandon S Allbery KF8NH

(I am not updating in place). The move generator produces a new board for each move.
Well, this is sound design, but current memory managers may not be up to it. If you check the (board) game programming literature, you'll find that engine authors take great efforts to bypass automatic memory management (do all updates on the "current" board in-place, and write their own malloc/free for game tree nodes). This becomes even more important when trying to exploit concurrency. In theory, that's all very nice (you can evaluate independent branches of the game tree in parallel) but in practice, you're doomed if your memory allocator/collector is still essentially single-threaded (that is, blocks all threads). That's language independent (it's a property of the run-time system). Of course in a more declarative language it should be easier for the compiler to analyze the source code and replace malloc/free by something better (no allocation by immediate re-use, or easy deallocation by some stack regime). J.W. -- View this message in context: http://www.nabble.com/Advice-wanted-on-parallel-processing-tp22580444p225917... Sent from the Haskell - Glasgow-haskell-users mailing list archive at Nabble.com.

">" == j waldmann
writes:
>> (I am not updating in place). The move generator produces a >> new board for each move. >> Well, this is sound design, but current memory managers may not >> be up to it. If you check the (board) game programming >> literature, you'll find that engine authors take great efforts >> to bypass automatic memory management (do all updates on the >> "current" board in-place, and write their own malloc/free for >> game tree nodes). I know it. In fact I wrote a program for this game about 10 years ago, in C++, which did all updates in place. It wasn't very good (although being the only one that implemented the rules correctly - as far as I could tell - they are very complicated - was necessarily the best). Now I want to have another go in Haskell. Reading about DPH inspired me to give it a try, although I'm not attempting to use DPH yet. Probably I was too naive thinking I was going to be able to exploit parallelism without pain. >> This becomes even more important when trying to exploit >> concurrency. In theory, that's all very nice (you can evaluate >> independent branches of the game tree in parallel) but in >> practice, you're doomed if your memory allocator/collector is >> still essentially single-threaded (that is, blocks all >> threads). OK, that's interesting. GHC's collector is still stop-the-world at the moment, right? I read in the recent paper that it is intended to try to remedy that soon, in which case I can try again to see if that improves matters. -- Colin Adams Preston Lancashire
participants (4)
-
Brandon S. Allbery KF8NH
-
Colin Paul Adams
-
Daniel Fischer
-
j.waldmann