Re: [Haskell-cafe] strict version of Haskell - does it exist?

jsonLines :: C.Resource m => C.Conduit B.ByteString m Value jsonLines = C.sequenceSink () $ do val <- CA.sinkParser json' CB.dropWhile isSpace_w8 return $ C.Emit () [val]
Adding a \state -> (the way Felipe Lessa told me) make is work and it runs in about 20sec and that although some conduit overhead is likely to take place. omitting my custom data type using bytestrings operating on Value of Aeson reduces running time to 16secs. PHP/C++ still wins: less than 12secs. Now I can imagine again that even a desktop multi core system is faster than a single threaded C application. Thanks for your help. Maybe I can setup profiling again to understand why its still taking little bit more time. Marc Weber

On Tue, Jan 31, 2012 at 1:36 PM, Marc Weber
Adding a \state -> (the way Felipe Lessa told me) make is work and it runs in about 20sec and that although some conduit overhead is likely to take place.
Just out of curiosity: did you use conduit 0.1 or 0.2? Cheers! =) -- Felipe.

Excerpts from Felipe Almeida Lessa's message of Tue Jan 31 16:49:52 +0100 2012:
Just out of curiosity: did you use conduit 0.1 or 0.2? I updated to 0.2 today because I was looking for a monad instance for SequenceSink - but didn't find it cause I tried using it the wrong way (\state -> see last mail)
I also tried json' vs json (strict and non strict versions) - didn't seem to make a big difference. Marc Weber

Hi Everyone,
I had a similar experience with a similar type of problem. The
application was analyzing web pages that our web crawler had collected,
well not the pages themselves but metadata about when the page was
collected.
The basic query was:
SELECT
Domain, Date, COUNT(*)
FROM
Pages
GROUP BY
Domain, Date
The webpage data was split out across tens of thousands of files compressed
binary. I used enumerator to load these files and select the appropriate
columns. This step was performed in parallel using parMap and worked fine
once i figured out how to add the appropriate !s.
The second step was the group by. I built some tools across monad-par that
had the normal higher level operators like map, groupBy, filter, etc... The
typical pattern I followed was the map-reduce style pattern used in
monad-par. I was hoping to someday share this work, although I have since
abandoned work on it.
It took me a couple of weeks to get the strictness mostly right. I say
mostly because it still randomly blows up, meaning if I feed in a single
40kb file maybe 1 time in 10 it consumes all the memory on the machine in a
few seconds. There is obviously a laziness bug in there somewhere although
after working on it for a few days and failing to come up with a solid
repro case I eventually built all the web page analysis tools in scala, in
large part because I did not see a way forward and need to tie off that
work and move on.
My observations:
Combining laziness and parallelism made it very difficult to reason about
what was going on. Test cases became non-deterministic not in terms out
output in the success case but whether they ran at all.
The tooling around laziness does not give enough information about
debugging complex problems. Because of this when people ask "Is Haskell
good for parallel development?" I tell them the answer is complicated.
Haskell has excellent primitives for parallel development like the STM
which I love but it lacks a PLINQ like toolkit that is fully built out to
enable flexible parallel data processing.
The other thing is that deepseq is very important . IMHO this needs to be a
first class language feature with all major libraries shipping with deepseq
instances. There seems to have been some movement on this front but you
can't do serious parallel development without it.
Some ideas for things that might help would be a plugin for vim that showed
the level of strictness of operations and data. I am going to take another
crack at a PLINQ like library with GHC 7.4.1 in the next couple of months
using the debug symbols that Peter has been working on.
Conclusion:
Haskell was the wrong platform to be doing webpage analysis anyhow, not
because anything is wrong with the language but simply it does not have the
tooling that the JVM does. I moved all my work into Hadoop to take
advantage of multi-machine parallelism and higher level tools like Hive.
There might be a future in building haskell code that could be translated
into a Hive query.
With better tools I think that Haskell can become the goto language for
developing highly parallel software. We just need the tools to help
developers better understand the laziness of their software. There also
seems to be a documentation gap on developing data analysis or data
transformation pipelines in haskell.
Sorry for the length. I hope my experience is useful to someone.
Steve
On Tue, Jan 31, 2012 at 7:57 AM, Marc Weber
Excerpts from Felipe Almeida Lessa's message of Tue Jan 31 16:49:52 +0100 2012:
Just out of curiosity: did you use conduit 0.1 or 0.2? I updated to 0.2 today because I was looking for a monad instance for SequenceSink - but didn't find it cause I tried using it the wrong way (\state -> see last mail)
I also tried json' vs json (strict and non strict versions) - didn't seem to make a big difference.
Marc Weber
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, Jan 31, 2012 at 9:19 PM, Steve Severance
The other thing is that deepseq is very important . IMHO this needs to be a first class language feature with all major libraries shipping with deepseq instances. There seems to have been some movement on this front but you can't do serious parallel development without it.
I completely agree on the first part, but deepseq is not a panacea either.
It's a big hammer and overuse can sometimes cause wasteful O(n) no-op
traversals of already-forced data structures. I also definitely wouldn't go
so far as to say that you can't do serious parallel development without it!
The only real solution to problems like these is a thorough understanding
of Haskell's evaluation order, and how and why call-by-need is different
than call-by-value. This is both a pedagogical problem and genuinely hard
-- even Haskell experts like the guys at GHC HQ sometimes spend a lot of
time chasing down space leaks. Haskell makes a trade-off here; reasoning
about denotational semantics is much easier than in most other languages
because of purity, but non-strict evaluation makes reasoning about
operational semantics a little bit harder.
In domains where you care a lot about operational semantics (like parallel
and concurrent programming, where it's absolutely critical), programmers
necessarily require a lot more experience and knowledge in order to be
effective in Haskell.
G
--
Gregory Collins

On Tue, Jan 31, 2012 at 1:22 PM, Gregory Collins
I completely agree on the first part, but deepseq is not a panacea either. It's a big hammer and overuse can sometimes cause wasteful O(n) no-op traversals of already-forced data structures. I also definitely wouldn't go so far as to say that you can't do serious parallel development without it!
I agree. The only time I ever use deepseq is in Criterion benchmarks, as it's a convenient way to make sure that the input data is evaluated before the benchmark starts. If you want a data structure to be fully evaluated, evaluate it as it's created, not after the fact.
The only real solution to problems like these is a thorough understanding of Haskell's evaluation order, and how and why call-by-need is different than call-by-value. This is both a pedagogical problem and genuinely hard -- even Haskell experts like the guys at GHC HQ sometimes spend a lot of time chasing down space leaks. Haskell makes a trade-off here; reasoning about denotational semantics is much easier than in most other languages because of purity, but non-strict evaluation makes reasoning about operational semantics a little bit harder.
+1 We can do a much better job at teaching how to reason about performance. A few rules of thumb gets you a long way. I'm (slowly) working on improving the state of affairs here. -- Johan

http://www.vex.net/~trebla/haskell/lazy.xhtml It is half done.

On Tue, Jan 31, 2012 at 12:19 PM, Steve Severance
The webpage data was split out across tens of thousands of files compressed binary. I used enumerator to load these files and select the appropriate columns. This step was performed in parallel using parMap and worked fine once i figured out how to add the appropriate !s.
Even though advertised as parallel programming tools, parMap and other functions that work in parallel over *sequential* access data structures (i.e. linked lists.) We want flat, strict, unpacked data structures to get good performance out of parallel algorithms. DPH, repa, and even vector show the way. -- Johan

Even though advertised as parallel programming tools, parMap and other functions that work in parallel over *sequential* access data structures (i.e. linked lists.) We want flat, strict, unpacked data structures to get good performance out of parallel algorithms. DPH, repa, and even vector show the way.
You would think that tree data structures would be good here as well. For example, monad-par includes a definition of an append-based "AList" (like Guy Steele argues for). But alas that turns out to be much harder to get working well. For most algorithms Vectors so often end up better. -Ryan
participants (7)
-
Albert Y. C. Lai
-
Felipe Almeida Lessa
-
Gregory Collins
-
Johan Tibell
-
Marc Weber
-
Ryan Newton
-
Steve Severance