Google Summer of Code student application deadline approaches

Hi, The Summer of Code student application deadline is April 3rd and we need more applications! If you have an idea that you would like to hack on this summer hurry up and apply. Here's an idea that could be implemented in a summer: A Sawzal [1]l like library that processes large volumes of data using monoids to compute aggregate statistics. Foldable makes this possible for smaller data sets than can fit in memory but many interesting data sets are tens or hundreds of gigabytes in size. A simple API with a high performance implementation would make Haskell a nice data analysis tool. Here's a strawman interface for such a library: -- | Given a file of log records compute aggregate statistics by converting each record -- to a monoid @m@ and combine the resulting monoids using 'mappend'. fold :: (Record r, Monoid m) => (r -> m) -> FilePath -> IO m There are lots of interesting optimizations that could be done. Starting with an efficient single threaded implementation using ByteString you could add the ability to either process many files in parallel or splitting one file into many chunks and process each chunk in parallel. The Wide Finder 2 [2] challenge has a fast Ocaml implementation of the latter strategy. One could take the library further by running the processing on multiple machines like in the Google Sawzall implementation. 1. http://research.google.com/archive/sawzall.html 2. http://www.tbray.org/ongoing/When/200x/2008/05/01/Wide-Finder-2 Cheers, Johan
participants (1)
-
Johan Tibell