
dagit:
On Thu, Sep 18, 2008 at 12:31 PM, Creighton Hogg
wrote: On Thu, Sep 18, 2008 at 1:55 PM, Don Stewart
wrote: wchogg:
On Thu, Sep 18, 2008 at 1:29 PM, Don Stewart
wrote: <snip> This makes me cry.
import System.Environment import qualified Data.ByteString.Lazy.Char8 as B
main = do [f] <- getArgs s <- B.readFile f print (B.count '\n' s)
Compile it.
$ ghc -O2 --make A.hs
$ time ./A /usr/share/dict/words 52848 ./A /usr/share/dict/words 0.00s user 0.00s system 93% cpu 0.007 total
Against standard tools:
$ time wc -l /usr/share/dict/words 52848 /usr/share/dict/words wc -l /usr/share/dict/words 0.01s user 0.00s system 88% cpu 0.008 total
So both you & Bryan do essentially the same thing and of course both versions are far better than mine. So the purpose of using the Lazy version of ByteString was so that the file is only incrementally loaded by readFile as count is processing?
Yep, that's right
The streaming nature is implicit in the lazy bytestring. It's kind of the dual of explicit chunkwise control -- chunk processing reified into the data structure.
To ask an overly general question, if lazy bytestring makes a nice provider for incremental processing are there reasons to _not_ reach for that as my default when processing large files?
Yes. The main time is when you "accidentally" force the whole file (or at least large parts of it) into memory at the same time. Profiling and careful programming seem to be the workarounds, but in a large application the "careful programming" part can become prohibitively expensive. This is due to the sometimes subtle nature of how strictness composes with laziness. This is a the result of a more general issue that it is non-obvious how your program is evaluated at run-time thanks to lazy evaluation, thus making lazy evaluation act as a double edged sword at times. I'm not saying get rid of lazy eval, but occasionally it presents problems for efficiency and diagnosing efficiency problems.
The rule seems to be: Write correct code first, fix the problems (usually just inefficiencies) later.
Using lazy bytestrings makes it easier to write concise code that is more easily inspected for correctness. Perhaps it is even easier to test such code, but I'm skeptical of that. Thus, I think most people here would agree that reaching first for lazy byte string is preferred over other techniques. Plus, the one of the most common fixes to inefficient haskell programs is to make them lazy in the right places and strict in key places and using lazy bytestring will get you part of the way to that refactoring usually.
Work on the "dual" of lazy bytestrings -- chunked enumerators -- may lead to more options in this area. The question of compositionality of left-fold enumerators remains (afaik), but we'll see. -- Don