I spent four hours investigating this problem!  Thank you very much for the excellent brainfood, and challenging Haskell's claim to be rawkin' at parallelism.  I think, though it took much experimentation, that I have confirmed that it is :-)

On Sun, Feb 1, 2009 at 9:26 PM, John D. Ramsdell <ramsdell0@gmail.com> wrote:
>
> I have a reduction system in which a rule takes a term and returns a
> set of terms.
> The reduction system creates a tree that originates at a starting
> value called the root.
> For most problems, the reduction system terminates, but a step count
> limit protects
> from non-termination.

That's typically a bad idea.  Instead, use laziness to protect from nontermination.  For example, in this case, we can output a collection of items lazily, and then take a finite amount of the output (or check whether the output is longer than some length), without having to evaluate all of it.

Here's my writeup of my solution, in literate Haskell.  It doesn't output the exact same structure as yours, but hopefully you can see how to tweak it to do so.

> {-# LANGUAGE RankNTypes #-}
>
> import qualified Data.MemoCombinators as Memo
> import qualified Data.Set as Set
> import Control.Parallel (par)
> import qualified Control.Parallel.Strategies as Par
> import Data.Monoid (Monoid(..))
> import Control.Monad.State
> import qualified Data.DList as DList
> import Debug.Trace
> import Control.Concurrent
First, I want to capture the idea of a generative set like you're doing. GenSet is like a set, with the constructor "genset x xs" which says "if x is in the set, then so are xs". I'll represent it as a stateful computation of the list of things we've seen so far, returning the list of things we've seen so far. It's redundant information, but sets can't be consumed lazily, thus the list (the set will follow along lazily :-). Remember that State s a is just the function (s -> (s,a)). So we're taking the set of things we've seen so far, and returning the new elements added and the set unioned with those elements.
> newtype GenSet a 
> = GenSet (State (Set.Set a) (DList.DList a))
>
> genset :: (Ord a) => a -> GenSet a -> GenSet a
> genset x (GenSet f) = GenSet $ do
> seen <- gets (x `Set.member`)
> if seen
> then return mempty
> else fmap (DList.cons x) $
> modify (Set.insert x) >> f
>
> toList :: GenSet a -> [a]
> toList (GenSet f) = DList.toList $ evalState f Set.empty
GenSet is a monoid, where mappend is just union.
> instance (Ord a) => Monoid (GenSet a) where
> mempty = GenSet (return mempty)
> mappend (GenSet a) (GenSet b) =
> GenSet (liftM2 mappend a b)
Okay, so that's how we avoid exponential behavior when traversing the tree. We can now just toss around GenSets like they're sets and everything will be peachy. Here's the heart of the algorithm: the reduce function. To avoid recomputation of rules, we could just memoize the rule function. But we'll do something a little more clever. The function we'll memoize ("parf") first sparks a thread computing its *last* child. Because the search is depth-first, it will typically be a while until we get to the last one, so we benefit from the spark (you don't want to spark a thread computing something you're about to compute anyway).
> reduce :: (Ord a) => Memo.Memo a -> (a -> [a]) -> a -> [a]
> reduce memo f x = toList (makeSet x)
> where
> makeSet x = genset x . mconcat . map makeSet . f' $ x
> f' = memo parf
> parf a = let ch = f a in
> ch `seq` (f' (last ch) `par` ch)
The ch `seq` is there so that the evaluation of ch and last ch aren't competing with each other. Your example had a few problems. You said the rule was supposed to be expensive, but yours was cheap. Also, [x-1,x-2,x-3] are all very near each other, so it's hard to go do unrelated stuff. I made a fake expensive function before computing the neighbors, and tossed around some prime numbers to scatter the space more.
> rule :: Int -> [Int]
> rule n = expensive `seq`
> [next 311 4, next 109 577, next 919 353]
> where
> next x y = (x * n + y) `mod` 5000
> expensive = sum [1..50*n]
>
> main :: IO ()
> main = do
> let r = reduce Memo.integral rule 1
> print (length r)
The results are quite promising: % ghc --make -O2 rules2 -threaded % time ./rules2 5000 ./rules2 13.25s user 0.08s system 99% cpu 13.396 total % time ./rules2 +RTS -N2 5000 ./rules2 +RTS -N2 12.52s user 0.30s system 159% cpu 8.015 total That's 40% decrease in running time! Woot! I'd love to see what it does on a machine with more than 2 cores.

Enjoy!
Luke