strict version of Haskell - does it exist? - Haskell-Cafe

strict version of Haskell - does it exist?

older
[Haskell Cafe] strict version of...

Marc Weber

29 Jan 2012 29 Jan '12

9:13 p.m.

A lot of work has been gone into GHC and its libraries. However for some use cases C is still preferred, for obvious speed reasons - because optimizing an Haskell application can take much time. Is there any document describing why there is no ghc --strict flag making all code strict by default? Wouldn't this make it easier to apply Haskell to some additional fields such as video processing etc? Wouldn't such a '--strict' flag turn Haskell/GHC into a better C/gcc compiler? Projects like this: https://github.com/thoughtpolice/strict-ghc-plugin show that the idea is not new. Eg some time ago I had to do some logfile analysis. I ended doing it in PHP because optimizing the Haskell code took too much time. Marc Weber

Show replies by date

Don Stewart

29 Jan 29 Jan

9:55 p.m.

Generally strict Haskell means using strict data types - vectors, arrays, bytestrings, intmaps where required. However, you usually don't want all code and data strict, all the time, since laziness/on-demand eval is critical for deferring non-essential work. Summary; -fstrict wouldn't magically make your code good. Using the right balance of strict and lazy code, via the right choice of strict and lazy types, however, often does. Id be interested to know what choices were made in your log file case led you into problems -- using something excessively lazy (like lazy lists) or something excessively strict (like strict bytestrings) would both be suboptimal for log analysis. A hybrid type like a lazy bytestring, would be more appropriate. On Sunday, January 29, 2012, Marc Weber wrote:

...

A lot of work has been gone into GHC and its libraries. However for some use cases C is still preferred, for obvious speed reasons - because optimizing an Haskell application can take much time.

Is there any document describing why there is no ghc --strict flag making all code strict by default? Wouldn't this make it easier to apply Haskell to some additional fields such as video processing etc?

Wouldn't such a '--strict' flag turn Haskell/GHC into a better C/gcc compiler?

Projects like this: https://github.com/thoughtpolice/strict-ghc-plugin show that the idea is not new.

Eg some time ago I had to do some logfile analysis. I ended doing it in PHP because optimizing the Haskell code took too much time.

Marc Weber

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Marc Weber

10:47 p.m.

Excerpts from Don Stewart's message of Sun Jan 29 22:55:08 +0100 2012:

...

Summary; -fstrict wouldn't magically make your code good. No. you're right. I don't expect that to happen. I agree on it being always the programmers fault using wrong tools or not knowing the tools well enough to get a job done.

The PHP code looks like this: foreach(glob('*.txt') as $file){ foreach(split(file_get_contents($file)) => $line){ $parsed_line = json_decode($line); // get some unix timestamps, keep some hashes of seen clients // (cookie ids) and such // check how many minutes before a checkout the customer visited // the site - and whether he did so for a couple of days. } } // print result The files are about 300 MB in size. However memory usage grew and grew and grew - I had to kill it or limit amount of files. The PHP code runs in a couple of seconds (parsing json and loading files).. the Haskell app took much longer. That PHP is fast is no surprise: I expect json_decode and split to be implemented in C. So yes - I used lazy lists. However 8GB of RAM should have been enough to keep things in RAM. So maybe also the JSON parsing library kept too many unevaluated things in memory. So I could start either writing my own JSON parsing library (being more strict) or profile the application many times - but I don't want to. Ignoring the json parsing I also gave conduits a try - only counting lines. I know its experimental - but from its description I concluded it would prevent me being a stupid Haskell programmer from taking too much memory even using bad Haskell code. However it looked like splitting the lines only counting them (recognizing utf-8 chars) took a lot longer than also doing the json parsing in PHP. Then the conduit implementation looked funny: hGetLine is used to feed a line each time ... (luckily - because the utf8-libraries don't provide nice ways to parse incomplete chunks of utf-8 bytes such as returning Either IncompleteMissingByte UTF8Chunk).. Probably the split(,"\n") in PHP doesn't parse utf-8 chars - and luckily it doesn't seem to matter because \n only uses one byte. I know that I'm not an Haskell expert. I still got the impression that getting nice performance would be a small challenge and require much more time than I spend on the PHP version. That's why I'm wondering why there is no -fstrict option for such simple use cases because Haskell has so many optimizations other languages dream about. I mean lot's of "non lazy" language still get their jobs done. So it always depends on the use case. Isn't it easy to add a compiler flag to GHC adding those ! strictness annotations everywhere possible? Then simple use case like this would not be a huge challenge. Maybe you're right: I should just prepare some dummy files and ask the community to help. However optimizing the JSON parser and my code just seemed to be too much effort .. Marc Weber

Herbert Valerio Riedel

30 Jan 30 Jan

8:21 a.m.

On Sun, 2012-01-29 at 23:47 +0100, Marc Weber wrote:

...

So maybe also the JSON parsing library kept too many unevaluated things in memory. So I could start either writing my own JSON parsing library (being more strict)

Jfyi, aeson has been added strict parser variants json' and value' [1] some time ago, so you shouldn't have to write your own stricter JSON parsing library... [1]: http://hackage.haskell.org/packages/archive/aeson/0.6.0.0/doc/html/Data-Aeso...

Felipe Almeida Lessa

9:19 a.m.

On Mon, Jan 30, 2012 at 6:21 AM, Herbert Valerio Riedel wrote:

...

On Sun, 2012-01-29 at 23:47 +0100, Marc Weber wrote:

...
So maybe also the JSON parsing library kept too many unevaluated things in memory. So I could start either writing my own JSON parsing library (being more strict)

Jfyi, aeson has been added strict parser variants json' and value' some time ago, so you shouldn't have to write your own stricter JSON parsing library...

Also, besides using those variants, you may also use the attoparsec-conduit library [1]. If you have processJson :: Value -> IO X then you'd need just something like import Data.Aeson (Value, json') import Data.Attoparsec.Char8 (isSpace_w8) import qualified Data.ByteString as B import qualified Data.Conduit as C import qualified Data.Conduit.Attoparsec as CA import qualified Data.Conduit.Binary as CB import qualified Data.Conduit.List as CL main = do ... ret <- forM_ fileList $ \fp -> do C.runResourceT $ CB.sourceFile fp C.$= jsonLines C.$= CL.mapM processJson C.$$ CL.consume print ret jsonLines :: C.Resource m => C.Conduit B.ByteString m Value jsonLines = C.sequenceSink () $ do val <- CA.sinkParser json' CB.dropWhile isSpace_w8 return $ C.Emit () [val] This code is extremely resource-friendly, since (a) you can't leak file descriptors and (b) you just have to make sure that processJson function isn't too lazy. It should be quite fast as well. Cheers! =) [1] http://hackage.haskell.org/package/attoparsec-conduit -- Felipe.

Chris Wong

29 Jan 29 Jan

10:12 p.m.

On Mon, Jan 30, 2012 at 10:13 AM, Marc Weber wrote:

...

A lot of work has been gone into GHC and its libraries. However for some use cases C is still preferred, for obvious speed reasons - because optimizing an Haskell application can take much time.

As much as any other high-level language, I guess. Don't compare apples to oranges and complain oranges aren't crunchy enough ;)

...

Is there any document describing why there is no ghc --strict flag making all code strict by default?

Yes -- it's called the Haskell Report :) GHC does a lot of optimization already. If making something strict won't change how it behaves, it will, using a process called strictness analysis. The reason why there is no --strict flag is that strictness isn't just something you turn on and off willy-nilly: it changes how the whole language works. Structures such as infinite lists and Don Stewart's lazy bytestrings *depend* on laziness for their performance.

...

Wouldn't this make it easier to apply Haskell to some additional fields such as video processing etc?

Wouldn't such a '--strict' flag turn Haskell/GHC into a better C/gcc compiler?

See above.

...

Projects like this: https://github.com/thoughtpolice/strict-ghc-plugin show that the idea is not new.

Not sure what that does, but I'll have a look at it.

...

Eg some time ago I had to do some logfile analysis. I ended doing it in PHP because optimizing the Haskell code took too much time.

That probably because you're using linked lists for strings. For intensive text processing, it's better to use the text package instead [1] Chris [1] http://hackage.haskell.org/package/text

...

Marc Weber

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Austin Seipp

10:43 p.m.

The strict-ghc-plugin (under my maintenance) is just a continuation of one of the original demos Max had for plugin support in the compiler. The idea is fairly simple: 'let' and 'case' are the forms for creating lazy/strict bindings in Core. It just systematically replaces all occurrences of 'let' in Core with 'case'. So 'let b = e1 in e2' becomes 'case e1 of { b -> e2 }', making 'b' strict. It also replaces all applications of the form 'App e1 e2' (which is also lazy, as e2 is evaluated on demand) with an equivalent binding like 'case e2 of { x -> App e1 x }'. Pretty simple, and results in a totally strict program. The idea is just a proof of concept; in particular, I (and likely Max although I cannot speak for him) am not using it as a position to say that sometimes you want everything strict. You don't; at some point, you're not even using Haskell anymore I suppose (remember: non-strict semantics.) I can't think of any instance in which I would need or want to use this plugin, honestly. But maybe someone else would - I did refactor it to where you can strictify individual functions, as opposed to full-blown modules, via annotations. So you could selectively strictify things if you found it beneficial on certain identifiers. But then there's the question of what affect that has on the rest of GHC's optimizers, which I cant answer: the strictifier modifies the pipeline to be the *first* pass, and the remaining ones run afterwords. Compilers are built on heuristics and built for 'average' code. Sometimes these heuristics interact in odd ways, especially with code that may deviate from 'the norm.' Once you're fighting the optimizer, it can become a very difficult battle to win. Careful analysis and selective optimization is probably going to take you farther than hitting it with a giant hammer. Having lazy and strict data structures and knowing when/where to use them is crucial for good performance, and both have their merits (same with every other thing under the sun, like by-ref/by-val semantics in `data` types, which you can control with UNPACK etc.) I think we could most certainly use better tools for analyzing low-level performance details and the tradeoff between strictness/laziness and (especially in large codebases,) but I don't think systematically making everything strict is going to be the right idea in a vast majority of situations. On Sun, Jan 29, 2012 at 4:12 PM, Chris Wong wrote:

...

On Mon, Jan 30, 2012 at 10:13 AM, Marc Weber wrote:

...
A lot of work has been gone into GHC and its libraries. However for some use cases C is still preferred, for obvious speed reasons - because optimizing an Haskell application can take much time.

As much as any other high-level language, I guess. Don't compare apples to oranges and complain oranges aren't crunchy enough ;)

...
Is there any document describing why there is no ghc --strict flag making all code strict by default?

Yes -- it's called the Haskell Report :)

GHC does a lot of optimization already. If making something strict won't change how it behaves, it will, using a process called strictness analysis.

The reason why there is no --strict flag is that strictness isn't just something you turn on and off willy-nilly: it changes how the whole language works. Structures such as infinite lists and Don Stewart's lazy bytestrings *depend* on laziness for their performance.

...
Wouldn't this make it easier to apply Haskell to some additional fields such as video processing etc?

Wouldn't such a '--strict' flag turn Haskell/GHC into a better C/gcc compiler?

See above.

...
Projects like this: https://github.com/thoughtpolice/strict-ghc-plugin show that the idea is not new.

Not sure what that does, but I'll have a look at it.

...
Eg some time ago I had to do some logfile analysis. I ended doing it in PHP because optimizing the Haskell code took too much time.

That probably because you're using linked lists for strings. For intensive text processing, it's better to use the text package instead [1]

Chris

[1] http://hackage.haskell.org/package/text

...
Marc Weber

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Regards, Austin

Ertugrul Söylemez

10:25 p.m.

Marc Weber wrote:

...

A lot of work has been gone into GHC and its libraries. However for some use cases C is still preferred, for obvious speed reasons - because optimizing an Haskell application can take much time.

Is there any document describing why there is no ghc --strict flag making all code strict by default? Wouldn't this make it easier to apply Haskell to some additional fields such as video processing etc?

Wouldn't such a '--strict' flag turn Haskell/GHC into a better C/gcc compiler?

Projects like this: https://github.com/thoughtpolice/strict-ghc-plugin show that the idea is not new.

Eg some time ago I had to do some logfile analysis. I ended doing it in PHP because optimizing the Haskell code took too much time.

First of all, /learning/ to optimize Haskell can be difficult. The optimizing itself is actually fairly easy in my experience, once you understand how the language works. Usually the nonstrictness is no bottleneck. However, you have to know that you are in a nonstrict language. In fact, I find myself having difficulties writing efficient code in a strict language. Now to answer your question: A strict-by-default Haskell comes with the implication that you can throw away most of the libraries, including the base library. So yes, a strict-by-default Haskell is very well possible, but the question is whether you actually want that. I wouldn't, because a lot of my code relies on the standard semantics. I would also expect problems with the way Haskell performs I/O, because it would mean that forever (putStrLn "Hello world") would cause a heap overflow, if Haskell were strict. Note that we don't have control structures. We have combinators, and their nonstrictness is essential. The flag you are proposing would turn Haskell into a language that is different enough that you couldn't do many useful things with it. If you want to save the time to learn how to write efficient Haskell programs, you may want to have a look into the Disciple language. You will find that it has a different type system, which captures side effects explicitly to make a pure strict language even possible. Greets, Ertugrul -- nightmare = unsafePerformIO (getWrongWife >>= sex) http://ertes.de/

Alexander Bernauer

30 Jan 30 Jan

9:52 a.m.

On Sun, Jan 29, 2012 at 11:25:09PM +0100, Ertugrul Söylemez wrote:

...

First of all, /learning/ to optimize Haskell can be difficult. The optimizing itself is actually fairly easy in my experience, once you understand how the language works.

Given the fact that you have obviously mastered the learning part of this, do you have any recommendations on what to read or on how to proceed in order to learn how to optimize Haskell code? I can imagine, it's not only about understanding the language itself, but also about understanding how your compiler and its switches work, being able to find the hot spots in your code, being able to measure the effects of your changes, developing a good sense for the tradeoffs, etc. So far, I have only stumpled upon chapter 25 of Real World Haskell. Anything else you might recommend? regards Alex

Ertugrul Söylemez

2:25 p.m.

Alexander Bernauer wrote:

...

On Sun, Jan 29, 2012 at 11:25:09PM +0100, Ertugrul Söylemez wrote:

...
First of all, /learning/ to optimize Haskell can be difficult. The optimizing itself is actually fairly easy in my experience, once you understand how the language works.

Given the fact that you have obviously mastered the learning part of this, do you have any recommendations on what to read or on how to proceed in order to learn how to optimize Haskell code?

I can imagine, it's not only about understanding the language itself, but also about understanding how your compiler and its switches work, being able to find the hot spots in your code, being able to measure the effects of your changes, developing a good sense for the tradeoffs, etc.

So far, I have only stumpled upon chapter 25 of Real World Haskell. Anything else you might recommend?

That's the tricky part about this. Unfortunately the monad tutorial fallacy [1] applies here. At some point it makes click and suddenly you see all the data dependencies. If you were schizophrenic, you would probably see little arrows all over your code. What helped me a bit was to read the Wikibooks page about Haskell's denotational semantics [2], but other than that I pretty much learned it by myself. In any case it's helpful to use 'seq' explicitly instead of bang patterns. Personally I never use bang patterns, but always use seq, strict patterns or guards with strict functions. Example: add :: Integer -> Integer -> Integer add x 0 = x add x y = (add $! succ x) (pred y) Here the pattern forces the second argument, so there is no need to force it yourself. However, nothing forces the first argument, so it needs to be forced manually, in this case by using ($!). [1]: <http://byorgey.wordpress.com/2009/01/12/ abstraction-intuition-and-the-monad-tutorial-fallacy/> [2]: http://en.wikibooks.org/wiki/Haskell/Denotational_semantics Greets, Ertugrul -- nightmare = unsafePerformIO (getWrongWife >>= sex) http://ertes.de/

Joachim Breitner

7:21 p.m.

Hi, Am Montag, den 30.01.2012, 10:52 +0100 schrieb Alexander Bernauer:

...

On Sun, Jan 29, 2012 at 11:25:09PM +0100, Ertugrul Söylemez wrote:

...
First of all, /learning/ to optimize Haskell can be difficult. The optimizing itself is actually fairly easy in my experience, once you understand how the language works.

Given the fact that you have obviously mastered the learning part of this, do you have any recommendations on what to read or on how to proceed in order to learn how to optimize Haskell code?

I can imagine, it's not only about understanding the language itself, but also about understanding how your compiler and its switches work, being able to find the hot spots in your code, being able to measure the effects of your changes, developing a good sense for the tradeoffs, etc.

So far, I have only stumpled upon chapter 25 of Real World Haskell. Anything else you might recommend?

Although I would not claim that I have mastered this, I did recently hold a talk that touched some of these issues, and also exhibits a case where you want something more fine-grained than just strictness or lazyness. From your name I think it is safe to point you to a German document: http://www.joachim-breitner.de/blog/archives/539-Guest-lecture-on-Haskell-pe... Greetings, Joachim -- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/

Malcolm Wallace

11:24 a.m.

On 29 Jan 2012, at 22:25, Ertugrul Söylemez wrote:

...

A strict-by-default Haskell comes with the implication that you can throw away most of the libraries, including the base library. So yes, a strict-by-default Haskell is very well possible, but the question is whether you actually want that. I wouldn't, because a lot of my code relies on the standard semantics.

At work, we have a strict version of Haskell, and indeed we do not use the standard libraries, but have built up our own versions of the ones we use. However, our compiler is smart enough to transform and optimise the source code *as if* it were non-strict: it is only at runtime that things are evaluated strictly. This means that, in general, you can rely on the standard semantics to a surprisingly large extent. For instance, maybe (error "foo") lines (Just "hello\nworld") will succeed without calling error, just like in Haskell. Even if the final argument is supplied only at runtime, not statically, it will still do the right thing. However, the downside of being strict at runtime is frequently poorer performance. Stack overflows are common if you use explicit recursion: it is better to use higher-order functions (map, fold, until) that are implemented at a lower level in C (i.e. not directly using explicit recursion themselves). This is a good thing of course - thinking of data structures in the aggregate, rather than piecemeal. However, bulk operations do transform the entire data structure, not merely the fragments that are needed for the onward computation, so it can often be a net performance loss. The standard lazy computational paradigm of generate-and-test is therefore hideously expensive, on occasion. Regards, Malcolm

Jan-Willem Maessen

2:07 p.m.

On Mon, Jan 30, 2012 at 6:24 AM, Malcolm Wallace wrote:

...

On 29 Jan 2012, at 22:25, Ertugrul Söylemez wrote:

...
A strict-by-default Haskell comes with the implication that you can throw away most of the libraries, including the base library. So yes, a strict-by-default Haskell is very well possible, but the question is whether you actually want that. I wouldn't, because a lot of my code relies on the standard semantics.

At work, we have a strict version of Haskell, and indeed we do not use the standard libraries, but have built up our own versions of the ones we use. However, our compiler is smart enough to transform and optimise the source code *as if* it were non-strict: it is only at runtime that things are evaluated strictly. This means that, in general, you can rely on the standard semantics to a surprisingly large extent.

I wanted to emphasize Malcolm's point here. Optimizing using the original Haskell semantics turned out to be pretty important back when I was working on Eager Haskell. For example, a lot of Haskell functions are written in the following style: f a b c | guard1 d = g h i | guard2 e = h | guard3 f = i | otherwise = j where d = ...expensive... e = ...expensive... f = ...expensive... g = ...expensive... h = ...expensive... i = ...expensive... j = ... expensive... An an ordinary procedural language, where function calls in g, h, i, and j might have side effects, we can't sink bindings down to the point of use. Even in the absence of side effects, we have to account for the fact that some of these computations are used in some -- but not all -- right-hand sides, and that often we need to do some inlining to discover that a value isn't going to be used. It turns out Haskell code relies on this sort of thing all over the place, and simply coding around it leads to deeply-nested let bindings that walk off the right-hand edge of the screen. It's not difficult to rewrite most of the prelude functions in this style, but it's no longer pretty, and it's recognizably not idiomatic Haskell. However, bulk operations do transform the entire data structure, not merely

...

the fragments that are needed for the onward computation, so it can often be a net performance loss. The standard lazy computational paradigm of generate-and-test is therefore hideously expensive, on occasion.

This was a huge issue in Eager Haskell. By far our worst performance was on stream-like programs that generated infinite lists of results, and then sliced away the useless tail. With completely strict evaluation, this of course doesn't work at all, but it can be tricky to bound the required sizes of inputs even if you know how much of the output you want (imagine a call to takeWhile or filter on an infinite list). -Jan-Willem Maessen

...

Regards, Malcolm _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Marc Weber

4:12 p.m.

Replying to all replies at once:

...

Malcolm Wallace At work, we have a strict version of Haskell :-) which proofs that it is worth thinking about it.

...

Ertugrul If you want to save the time to learn how to write efficient Haskell programs, you may want to have a look into the Disciple language. Yes - I will. Its on my TODO list for at least 12 month :( Not sure whether there are parser combinator libraries yet.

@ Herbert Valerio Riedel (suggesting aeson) I gave it yet another try - see below. It still fails. @ Felipe Almeida Lessa (suggesting conduits and atto parsec) I mentioned that I already tried it. Counting lines only was a lot slower than counting lines and parsing JSON using PHP. @ Chris Wong

...

flag is that strictness isn't just something you turn on and off willy-nilly You're right. But those features are not required for all tasks :) Eg have a look at Data.Map. Would a strict implementation look that different?

I came up with this now. Trying strict bytestrings and Aeson. note the LB.fromChunks usage below to turn a strict into a lazy bytestring. Result: PHP script doing the same runs in 33secs (using the AFindCheckoutsAndContacts branch) The haskell program below - I stopped it after 8 min. (At least it does no longer cause my system to swap .. You're right: I could start profiling. I could learn about how to optimize it.) But why? The conclusion is that if I ever use yesod - and if I ever want to analyse logs - I should call PHP from yesod as external process !? :-( Even if I had a core i7 (8 threads in parallel) I still would not be sure whether Haskell would be the right choice. I agree that I assume that all data fits into memory so that piecewise evaluation doesn't matter too much. Thanks for all your help - it proofs once again that the Haskell community is the most helpful I know about. Yet I think me being the programmer Haskell is the wrong choice for this task. Thanks Marc Weber my new attempt - now I should start profiling.. Looks like I haven't built all libs with profiling support .. import Data.Aeson.Types import Data.Aeson import Data.List import Control.Applicative import Debug.Trace import qualified Data.Map as M import Action import Data.Aeson.Parser as AP import qualified Data.ByteString.Lazy as LB import qualified Data.ByteString as BS import qualified Data.ByteString.Char8 as LBC8 data Action = ACountLine | AFindCheckoutsAndContacts -- lines look like this: -- {"id":"4ee535f01550c","r":"","ua":"Mozilla\/5.0 (compatible; bingbot\/2.0; +http:\/\/www.bing.com\/bingbot.htm)","t":1323644400,"k":"XXX","v":"YY"} data Item = Item { id :: SB.ByteString, ua :: SB.ByteString, t :: Int, k :: SB.ByteString, v :: SB.ByteString } instance FromJSON Item where parseJSON (Object v) = Item <$> v .: "id" <*> v .: "ua" <*> v .: "t" <*> v .: "k" <*> v .: "v" parseJSON _ = empty run :: Action -> [FilePath] -> IO () run AFindCheckoutsAndContacts files = do -- find all ids quering the server more than 100 times. -- do so by building a map counting queries (lines :: [BS.ByteString]) <- fmap (concat . (map LBC8.lines) ) $ mapM BS.readFile files let processLine :: (M.Map BS.ByteString Int) -> BS.ByteString -> (M.Map BS.ByteString Int) processLine st line = case decode' (LB.fromChunks [line]) of Nothing -> st -- traceShow ("bad line " ++ (LBC8.unpack line)) st Just (Item id ua t k v) -> M.insertWith (+) k 1 st let count_by_id = foldl' processLine (M.empty) lines mapM_ print $ M.toList $ M.filter (> 100) count_by_id

Felipe Almeida Lessa

4:36 p.m.

On Mon, Jan 30, 2012 at 2:12 PM, Marc Weber wrote:

...

@ Felipe Almeida Lessa (suggesting conduits and atto parsec) I mentioned that I already tried it. Counting lines only was a lot slower than counting lines and parsing JSON using PHP.

Then please take a deeper look into my code. What you said that you've tried is something else. Cheers, -- Felipe.

Marc Weber

31 Jan 31 Jan

8:05 a.m.

Using insertWith' gets time down to 30-40 secs (thus only being 3-4 times slower than PHP). PHP still is at 13 secs, does not require installing libraries - does not require compilation and is trivial to write. A trivial C++ application takes 11-12secs and even with some googling was trivial to write. Excerpts from Felipe Almeida Lessa's message of Mon Jan 30 17:36:46 +0100 2012:

...

Then please take a deeper look into my code. What you said that you've tried is something else. I didn't say that I tried your code. I gave enumerator package a try counting lines which I expected to behave similar to conduits because both serve a similar purpose. Then I hit the the "sourceFile" returns chunked lines issue (reported it, got fixed) - ....

Anyway: My log files are a json dictionary on each line: { id : "foo", ... } { id : "bar", ... } Now how do I use the conduit package to split a "chunked" file into lines? Or should I create a new parser "many json >> newline" ? Except that I think my processJson for this test should look like this because I want to count how often the clients queried the server. Probalby I should also be using CL.fold as shown in the test cases of conduit. If you tell me how you'd cope with the "one json dict on each line" issue I'll try to benchmark this solution as well. -- probably existing library functions can be used here .. processJson :: (M.Map T.Text Int) -> Value -> (M.Map T.Text Int) processJson m value = case value of Ae.Object hash_map -> case HMS.lookup (T.pack "id") hash_map of Just id_o -> case id_o of Ae.String id -> M.insertWith' (+) id 1 m _ -> m _ -> m _ -> m Marc Weber

Felipe Almeida Lessa

9:33 a.m.

On Tue, Jan 31, 2012 at 6:05 AM, Marc Weber wrote:

...

I didn't say that I tried your code. I gave enumerator package a try counting lines which I expected to behave similar to conduits because both serve a similar purpose. Then I hit the the "sourceFile" returns chunked lines issue (reported it, got fixed) - ....

Anyway: My log files are a json dictionary on each line:

{ id : "foo", ... } { id : "bar", ... }

Now how do I use the conduit package to split a "chunked" file into lines? Or should I create a new parser "many json >> newline" ?

Currently there are two solutions. The first one is what I wrote earlier on this thread: jsonLines :: C.Resource m => C.Conduit B.ByteString m Value jsonLines = C.sequenceSink () $ do val <- CA.sinkParser json' CB.dropWhile isSpace_w8 return $ C.Emit () [val] This conduit will run the json' parser (from aeson) and then drop any whitespace after that. Note that it will correctly parse all of your files but will also parse some files that don't conform to your specification. I assume that's fine. The other solution is going to released with conduit 0.2, probably today. There's a lines conduit that splits the file into lines, so you could write jsonLines above as: mapJson :: C.Resource m => C.Conduit B.ByteString m Value mapJson = C.sequenceSink () $ do val <- CA.sinkParser json' return $ C.Emit () [val] which doesn't need to care about newlines, and then change main to main = do ... ret <- forM_ fileList $ \fp -> do C.runResourceT $ CB.sourceFile fp C.$= CB.lines C.$= -- new line is here mapJson C.$= CL.mapM processJson C.$$ CL.consume print ret I don't know which solution would be faster. Either way, both solutions will probably be faster with the new conduit 0.2.

...

Except that I think my processJson for this test should look like this because I want to count how often the clients queried the server. Probalby I should also be using CL.fold as shown in the test cases of conduit. If you tell me how you'd cope with the "one json dict on each line" issue I'll try to benchmark this solution as well.

This issue was already being coped with in my previous e-mail =).

...

-- probably existing library functions can be used here .. processJson :: (M.Map T.Text Int) -> Value -> (M.Map T.Text Int) processJson m value = case value of Ae.Object hash_map -> case HMS.lookup (T.pack "id") hash_map of Just id_o -> case id_o of Ae.String id -> M.insertWith' (+) id 1 m _ -> m _ -> m _ -> m

Looks like the perfect job for CL.fold. Just change those three last lines in main from ... C.$= CL.mapM processJson C.$$ CL.consume into ... C.$$ CL.fold processJson and you should be ready to go. Cheers! -- Felipe.

Roman Leshchinskiy

30 Jan 30 Jan

4:43 p.m.

Marc Weber wrote:

...

Replying to all replies at once:

...
Malcolm Wallace At work, we have a strict version of Haskell :-) which proofs that it is worth thinking about it.

But doesn't necessarily prove that it's a good idea.

...

Just (Item id ua t k v) -> M.insertWith (+) k 1 st

Does replacing this by insertWith' help? Roman

Roman Leshchinskiy

4:48 p.m.

Marc Weber wrote:

...

Replying to all replies at once:

...
Malcolm Wallace At work, we have a strict version of Haskell :-) which proofs that it is worth thinking about it.

But doesn't necessarily prove that it's a good idea.

...

Just (Item id ua t k v) -> M.insertWith (+) k 1 st

Does replacing this by insertWith' help? Roman

4902

Age (days ago)

4904

Last active (days ago)

List overview

Download

18 comments

12 participants

participants (12)

Alexander Bernauer
Austin Seipp
Chris Wong
Don Stewart
Ertugrul Söylemez
Felipe Almeida Lessa
Herbert Valerio Riedel
Jan-Willem Maessen
Joachim Breitner
Malcolm Wallace
Marc Weber
Roman Leshchinskiy