
Haskell is conspicuously absent from the languages used to tackle Tim Bray's Wide Finder problem (http://www.tbray.org/ongoing/When/200x/ 2007/10/30/WF-Results?updated). So far we have Ocaml, Erlang, Python, Ruby, etc... Bryan quickly wrote a program on his blog (http://www.serpentine.com/ blog/2007/09/25/what-the-heck-is-a-wide-finder-anyway/) that would place Haskell right in the second position. JoCaml is the fastest so far (http://eigenclass.org/hiki.rb?fast- widefinder)... Can Haskell do better ? Care to take a shot ? Manu

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of manu
Haskell is conspicuously absent from the languages used to tackle Tim
Bray's Wide Finder problem
(http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results?updated).
So far we have Ocaml, Erlang, Python, Ruby, etc...
Tim Bray mentions that GHC won't build on Solaris, so presumably that problem would need to be solved before Haskell appears in his table. I see that there are Solaris binary packages: http://www.haskell.org/ghc/download_ghc_661.html#sparcsolaris so perhaps he just needs to be pointed to them? Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

I hacked together a version that I'm pretty happy with today. Started off trying an algorithm with channels and forking, then realized that in Haskell thanks to referential transparency we can get parallelism almost for free, and redid it all in Control.Parallel (below). Unfortunately, I don't have a multicore processor so I can't put this through any special paces. However, its compactness and expressively match or beat the simple Ruby, etc. scripts while it gets (theoretically) most of the parallel benefits of the enormous and unwieldy Erlang and JOcaml ones. --S module Main where import qualified Data.ByteString.Lazy.Char8 as LB import Data.List (foldl', unfoldr, insertBy) import qualified Data.Map as M import System.Environment (getArgs) import Control.Parallel (par) import Control.Parallel.Strategies (parMap, rwhnf) count :: M.Map LB.ByteString Int -> LB.ByteString -> M.Map LB.ByteString Int count m line = if LB.pack "/ongoing/When" `LB.isPrefixOf` myLn then M.insertWith' (+) (LB.drop 14 myLn) 1 m else m where myLn = (LB.takeWhile (/=' ') . LB.dropWhile (/='/') . LB.dropWhile (/='\"')) line mapUnionPar :: (Ord k, Num a) => [M.Map k a] -> M.Map k a mapUnionPar m = head $ until (null . tail) mapUnionPar' m where a |:| b = par a . par b $ a : b mapUnionPar' (x:x':xs) = (M.unionWith (+) x x' |:| mapUnionPar' xs) mapUnionPar' x = x newPar :: FilePath -> IO (M.Map LB.ByteString Int) newPar = ((mapUnionPar . parMap rwhnf (foldl' count M.empty) . chunkify . LB.lines) `fmap`) . LB.readFile where chunkify = unfoldr (\x -> if null x then Nothing else Just (splitAt 512 x)) main = mapM_ ((print . fst . foldl' takeTop ([],[]) . M.toList =<<) . newPar) =<< getArgs where takeTop ac@(bs,low) a = if null low || (snd . head) low < snd a then (splitAt 10 . insertBy ((. snd) . flip compare . snd) a) bs else ac On Nov 7, 2007, at 9:06 AM, Bayley, Alistair wrote:
From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of manu
Haskell is conspicuously absent from the languages used to tackle Tim
Bray's Wide Finder problem
(http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results? updated).
So far we have Ocaml, Erlang, Python, Ruby, etc...
Tim Bray mentions that GHC won't build on Solaris, so presumably that problem would need to be solved before Haskell appears in his table. I see that there are Solaris binary packages: http://www.haskell.org/ghc/download_ghc_661.html#sparcsolaris
so perhaps he just needs to be pointed to them?
Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. ***************************************************************** _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Sterling Clover wrote:
I hacked together a version that I'm pretty happy with today. Started off trying an algorithm with channels and forking, then realized that in Haskell thanks to referential transparency we can get parallelism almost for free, and redid it all in Control.Parallel (below). Unfortunately, I don't have a multicore processor so I can't put this through any special paces. However, its compactness and expressively match or beat the simple Ruby, etc. scripts while it gets (theoretically) most of the parallel benefits of the enormous and unwieldy Erlang and JOcaml ones.
--S
module Main where import qualified Data.ByteString.Lazy.Char8 as LB import Data.List (foldl', unfoldr, insertBy) import qualified Data.Map as M import System.Environment (getArgs) import Control.Parallel (par) import Control.Parallel.Strategies (parMap, rwhnf)
count :: M.Map LB.ByteString Int -> LB.ByteString -> M.Map LB.ByteString Int count m line = if LB.pack "/ongoing/When" `LB.isPrefixOf` myLn then M.insertWith' (+) (LB.drop 14 myLn) 1 m else m where myLn = (LB.takeWhile (/=' ') . LB.dropWhile (/='/') . LB.dropWhile (/='\"')) line
mapUnionPar :: (Ord k, Num a) => [M.Map k a] -> M.Map k a mapUnionPar m = head $ until (null . tail) mapUnionPar' m where a |:| b = par a . par b $ a : b mapUnionPar' (x:x':xs) = (M.unionWith (+) x x' |:| mapUnionPar' xs) mapUnionPar' x = x
newPar :: FilePath -> IO (M.Map LB.ByteString Int) newPar = ((mapUnionPar . parMap rwhnf (foldl' count M.empty) . chunkify . LB.lines) `fmap`) . LB.readFile where chunkify = unfoldr (\x -> if null x then Nothing else Just (splitAt 512 x))
main = mapM_ ((print . fst . foldl' takeTop ([],[]) . M.toList =<<) . newPar) =<< getArgs where takeTop ac@(bs,low) a = if null low || (snd . head) low < snd a then (splitAt 10 . insertBy ((. snd) . flip compare . snd) a) bs else ac
On Nov 7, 2007, at 9:06 AM, Bayley, Alistair wrote:
From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of manu
Haskell is conspicuously absent from the languages used to tackle Tim
Bray's Wide Finder problem
(http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results?updated).
So far we have Ocaml, Erlang, Python, Ruby, etc...
Tim Bray mentions that GHC won't build on Solaris, so presumably that problem would need to be solved before Haskell appears in his table. I see that there are Solaris binary packages: http://www.haskell.org/ghc/download_ghc_661.html#sparcsolaris
so perhaps he just needs to be pointed to them?
Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. ***************************************************************** _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe You didn't do a whole lot when I tried to run it. I know I am being mean, but that seems to be what Tim Bray is doing. He takes code and if it doesnt work, he isn't spending 3 weeks to figure it out.
So, I would just like to comment. I ran your code against an access.log file and it gave me this: [] ./a.out access.log

Um... you do realize that the code is only supposed to match against very specific lines in sample data sets that Bray provides, right? If your access log doesn't have lines exactly like those (and why would it?) then there's no reason to expect a result. --S On Nov 9, 2007, at 11:19 PM, Berlin Brown wrote:
Sterling Clover wrote:
I hacked together a version that I'm pretty happy with today. Started off trying an algorithm with channels and forking, then realized that in Haskell thanks to referential transparency we can get parallelism almost for free, and redid it all in Control.Parallel (below). Unfortunately, I don't have a multicore processor so I can't put this through any special paces. However, its compactness and expressively match or beat the simple Ruby, etc. scripts while it gets (theoretically) most of the parallel benefits of the enormous and unwieldy Erlang and JOcaml ones.
--S
module Main where import qualified Data.ByteString.Lazy.Char8 as LB import Data.List (foldl', unfoldr, insertBy) import qualified Data.Map as M import System.Environment (getArgs) import Control.Parallel (par) import Control.Parallel.Strategies (parMap, rwhnf)
count :: M.Map LB.ByteString Int -> LB.ByteString -> M.Map LB.ByteString Int count m line = if LB.pack "/ongoing/When" `LB.isPrefixOf` myLn then M.insertWith' (+) (LB.drop 14 myLn) 1 m else m where myLn = (LB.takeWhile (/=' ') . LB.dropWhile (/='/') . LB.dropWhile (/='\"')) line
mapUnionPar :: (Ord k, Num a) => [M.Map k a] -> M.Map k a mapUnionPar m = head $ until (null . tail) mapUnionPar' m where a |:| b = par a . par b $ a : b mapUnionPar' (x:x':xs) = (M.unionWith (+) x x' |:| mapUnionPar' xs) mapUnionPar' x = x
newPar :: FilePath -> IO (M.Map LB.ByteString Int) newPar = ((mapUnionPar . parMap rwhnf (foldl' count M.empty) . chunkify . LB.lines) `fmap`) . LB.readFile where chunkify = unfoldr (\x -> if null x then Nothing else Just (splitAt 512 x))
main = mapM_ ((print . fst . foldl' takeTop ([],[]) . M.toList =<<) . newPar) =<< getArgs where takeTop ac@(bs,low) a = if null low || (snd . head) low < snd a then (splitAt 10 . insertBy ((. snd) . flip compare . snd) a) bs else ac
On Nov 7, 2007, at 9:06 AM, Bayley, Alistair wrote:
From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of manu
Haskell is conspicuously absent from the languages used to tackle Tim
Bray's Wide Finder problem
(http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results? updated).
So far we have Ocaml, Erlang, Python, Ruby, etc...
Tim Bray mentions that GHC won't build on Solaris, so presumably that problem would need to be solved before Haskell appears in his table. I see that there are Solaris binary packages: http://www.haskell.org/ghc/download_ghc_661.html#sparcsolaris
so perhaps he just needs to be pointed to them?
Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. ***************************************************************** _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe You didn't do a whole lot when I tried to run it. I know I am being mean, but that seems to be what Tim Bray is doing. He takes code and if it doesnt work, he isn't spending 3 weeks to figure it out.
So, I would just like to comment. I ran your code against an access.log file and it gave me this:
[]
./a.out access.log

Sterling Clover wrote:
Um... you do realize that the code is only supposed to match against very specific lines in sample data sets that Bray provides, right? If your access log doesn't have lines exactly like those (and why would it?) then there's no reason to expect a result.
--S
On Nov 9, 2007, at 11:19 PM, Berlin Brown wrote:
Sterling Clover wrote:
I hacked together a version that I'm pretty happy with today. Started off trying an algorithm with channels and forking, then realized that in Haskell thanks to referential transparency we can get parallelism almost for free, and redid it all in Control.Parallel (below). Unfortunately, I don't have a multicore processor so I can't put this through any special paces. However, its compactness and expressively match or beat the simple Ruby, etc. scripts while it gets (theoretically) most of the parallel benefits of the enormous and unwieldy Erlang and JOcaml ones.
--S
module Main where import qualified Data.ByteString.Lazy.Char8 as LB import Data.List (foldl', unfoldr, insertBy) import qualified Data.Map as M import System.Environment (getArgs) import Control.Parallel (par) import Control.Parallel.Strategies (parMap, rwhnf)
count :: M.Map LB.ByteString Int -> LB.ByteString -> M.Map LB.ByteString Int count m line = if LB.pack "/ongoing/When" `LB.isPrefixOf` myLn then M.insertWith' (+) (LB.drop 14 myLn) 1 m else m where myLn = (LB.takeWhile (/=' ') . LB.dropWhile (/='/') . LB.dropWhile (/='\"')) line
mapUnionPar :: (Ord k, Num a) => [M.Map k a] -> M.Map k a mapUnionPar m = head $ until (null . tail) mapUnionPar' m where a |:| b = par a . par b $ a : b mapUnionPar' (x:x':xs) = (M.unionWith (+) x x' |:| mapUnionPar' xs) mapUnionPar' x = x
newPar :: FilePath -> IO (M.Map LB.ByteString Int) newPar = ((mapUnionPar . parMap rwhnf (foldl' count M.empty) . chunkify . LB.lines) `fmap`) . LB.readFile where chunkify = unfoldr (\x -> if null x then Nothing else Just (splitAt 512 x))
main = mapM_ ((print . fst . foldl' takeTop ([],[]) . M.toList =<<) . newPar) =<< getArgs where takeTop ac@(bs,low) a = if null low || (snd . head) low < snd a then (splitAt 10 . insertBy ((. snd) . flip compare . snd) a) bs else ac
On Nov 7, 2007, at 9:06 AM, Bayley, Alistair wrote:
From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of manu
Haskell is conspicuously absent from the languages used to tackle Tim
Bray's Wide Finder problem
(http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results?updated).
So far we have Ocaml, Erlang, Python, Ruby, etc...
Tim Bray mentions that GHC won't build on Solaris, so presumably that problem would need to be solved before Haskell appears in his table. I see that there are Solaris binary packages: http://www.haskell.org/ghc/download_ghc_661.html#sparcsolaris
so perhaps he just needs to be pointed to them?
Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. ***************************************************************** _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe You didn't do a whole lot when I tried to run it. I know I am being mean, but that seems to be what Tim Bray is doing. He takes code and if it doesnt work, he isn't spending 3 weeks to figure it out.
So, I would just like to comment. I ran your code against an access.log file and it gave me this:
[]
./a.out access.log
Which data set did you test it on?

http://www.tbray.org/tmp/o10k.ap is the basic data set. For heavier duty testing, folks seem to be appending it to itself 99 more times to yield a "o1000k.ap" dataset. I'd be curious for comments on my code or other suggestions to speed things up -- the strictness semantics of the mapUnionPar function seem pretty decent to me, but I'd like to find a way to give higher preference to evaluating later iterations of until as opposed to earlier ones (so as to improve memory performance) but can't think of any way to do that without explicit threads. Implementing memory mapped reads, as was suggested here recently in a different context, might be another big performance gain. On my decidedly not powerful machine (Mac PowerPC G5, 1.8GHz) I can't get much lower than 12.25s for the 1000k dataset (out of which, roughly 3s in GC), which is 192M, which is actually slower than his sample ruby implementation. :-(. I'm sure parallel processing will help quite a bit, however, as profiling indicates that most time is spent in the "count" function. Maps are a good choice for parallelism because they merge efficiently, but for the iterative aspect their performance leaves a lot to be desired. This seems evident in that even on a single processor, lower sizes of chunks, at least to a point, still improve overall performance, although this may possibly be equally an issue with space efficiency. I wonder if Haskell's lack of an efficient hashtable isn't hurting it here again too, but on the other hand for a real efficiency gain, switching to a custom-built trie that combined pattern matching and insertion into a single operation would probably be a significant win, and it would let us force unboxing ints too, for whatever that gains. --S On Nov 10, 2007, at 3:36 AM, Berlin Brown wrote:
Which data set did you test it on?

s.clover:
http://www.tbray.org/tmp/o10k.ap is the basic data set. For heavier duty testing, folks seem to be appending it to itself 99 more times to yield a "o1000k.ap" dataset. I'd be curious for comments on my code or other suggestions to speed things up -- the strictness semantics of the mapUnionPar function seem pretty decent to me, but I'd like to find a way to give higher preference to evaluating later iterations of until as opposed to earlier ones (so as to improve memory performance) but can't think of any way to do that without explicit threads. Implementing memory mapped reads, as was suggested here recently in a different context, might be another big performance gain.
On my decidedly not powerful machine (Mac PowerPC G5, 1.8GHz) I can't get much lower than 12.25s for the 1000k dataset (out of which, roughly 3s in GC), which is 192M, which is actually slower than his sample ruby implementation. :-(. I'm sure parallel processing will help quite a bit, however, as profiling indicates that most time is spent in the "count" function. Maps are a good choice for parallelism because they merge efficiently, but for the iterative aspect their performance leaves a lot to be desired. This seems evident in that even on a single processor, lower sizes of chunks, at least to a point, still improve overall performance, although this may possibly be equally an issue with space efficiency.
I wonder if Haskell's lack of an efficient hashtable isn't hurting it here again too, but on the other hand for a real efficiency gain, switching to a custom-built trie that combined pattern matching and insertion into a single operation would probably be a significant win, and it would let us force unboxing ints too, for whatever that gains.
Did you also try Bryan O'Sullivan's smp code, btw? http://www.serpentine.com/blog/2007/09/25/what-the-heck-is-a-wide-finder-any... -- Don

Sterling Clover wrote:
Maps are a good choice for parallelism because they merge efficiently, but for the iterative aspect their performance leaves a lot to be desired.
This is not consistent with my observations, I must say. What I've found to dominate the benchmark are straightforward string search and manipulation.
participants (6)
-
Bayley, Alistair
-
Berlin Brown
-
Bryan O'Sullivan
-
Don Stewart
-
manu
-
Sterling Clover