Hi Everyone,
I had a similar experience with a similar type of problem. The application was analyzing web pages that our web crawler had collected, well not the pages themselves but metadata about when the page was collected.
The basic query was:
SELECT
Domain, Date, COUNT(*)
FROM
Pages
GROUP BY
Domain, Date
The webpage data was split out across tens of thousands of files compressed binary. I used enumerator to load these files and select the appropriate columns. This step was performed in parallel using parMap and worked fine once i figured out how to add the appropriate !s.
The second step was the group by. I built some tools across monad-par that had the normal higher level operators like map, groupBy, filter, etc... The typical pattern I followed was the map-reduce style pattern used in monad-par. I was hoping to someday share this work, although I have since abandoned work on it.
It took me a couple of weeks to get the strictness mostly right. I say mostly because it still randomly blows up, meaning if I feed in a single 40kb file maybe 1 time in 10 it consumes all the memory on the machine in a few seconds. There is obviously a laziness bug in there somewhere although after working on it for a few days and failing to come up with a solid repro case I eventually built all the web page analysis tools in scala, in large part because I did not see a way forward and need to tie off that work and move on.
My observations:
Combining laziness and parallelism made it very difficult to reason about what was going on. Test cases became non-deterministic not in terms out output in the success case but whether they ran at all.
The tooling around laziness does not give enough information about debugging complex problems. Because of this when people ask "Is Haskell good for parallel development?" I tell them the answer is complicated. Haskell has excellent primitives for parallel development like the STM which I love but it lacks a PLINQ like toolkit that is fully built out to enable flexible parallel data processing.
The other thing is that deepseq is very important . IMHO this needs to be a first class language feature with all major libraries shipping with deepseq instances. There seems to have been some movement on this front but you can't do serious parallel development without it.
Some ideas for things that might help would be a plugin for vim that showed the level of strictness of operations and data. I am going to take another crack at a PLINQ like library with GHC 7.4.1 in the next couple of months using the debug symbols that Peter has been working on.
Conclusion:
Haskell was the wrong platform to be doing webpage analysis anyhow, not because anything is wrong with the language but simply it does not have the tooling that the JVM does. I moved all my work into Hadoop to take advantage of multi-machine parallelism and higher level tools like Hive. There might be a future in building haskell code that could be translated into a Hive query.
With better tools I think that Haskell can become the goto language for developing highly parallel software. We just need the tools to help developers better understand the laziness of their software. There also seems to be a documentation gap on developing data analysis or data transformation pipelines in haskell.
Sorry for the length. I hope my experience is useful to someone.
Steve
On Tue, Jan 31, 2012 at 7:57 AM, Marc Weber
<marco-oweber@gmx.de> wrote:
Excerpts from Felipe Almeida Lessa's message of Tue Jan 31 16:49:52 +0100 2012:
> Just out of curiosity: did you use conduit 0.1 or 0.2?
I updated to 0.2 today because I was looking for a monad instance for
SequenceSink - but didn't find it cause I tried using it the wrong way
(\state -> see last mail)
I also tried json' vs json (strict and non strict versions) - didn't
seem to make a big difference.
Marc Weber