
Hey Haskell, So for a fairly inane reason, I ended up taking a couple of minutes and writing a program that would spit out, to the console, the number of lines in a file. Off the top of my head, I came up with this which worked fine with files that had 100k lines: main = do path <- liftM head $ getArgs h <- openFile path ReadMode n <- execStateT (countLines h) 0 print n untilM :: Monad m => (a -> m Bool) -> (a -> m ()) -> a -> m () untilM cond action val = do truthy <- cond val if truthy then return () else action val >> (untilM cond action val) countLines :: Handle -> StateT Int IO () countLines = untilM (\h -> lift $ hIsEOF h) (\h -> do lift $ hGetLine h modify (+1)) If this makes anyone cringe or cry "you're doing it wrong", I'd actually like to hear it. I never really share my projects, so I don't know how idiosyncratic my style is.

On Thu, Sep 18, 2008 at 1:02 PM, Creighton Hogg
Hey Haskell, So for a fairly inane reason, I ended up taking a couple of minutes and writing a program that would spit out, to the console, the number of lines in a file. Off the top of my head, I came up with this which
Yay, golf! I love playing golf from my Perl days. How about this: main = print . length . lines =<< readFile . head =<< getArgs Salt with Bytestring for extra flavor (and speed). Kurt

On Thu, Sep 18, 2008 at 10:02 AM, Creighton Hogg
If this makes anyone cringe or cry "you're doing it wrong", I'd actually like to hear it.
Yes, that made me cry :-) Your code seems very convoluted, and quite successfully hides what it's really trying to do. Here's a version that is rather more concise, and which will be far faster besides. import qualified Data.ByteString.Lazy.Char8 as B import System.Environment main = mapM_ ((print =<<) . fmap (B.count '\n') . B.readFile) =<< getArgs

wchogg:
Hey Haskell, So for a fairly inane reason, I ended up taking a couple of minutes and writing a program that would spit out, to the console, the number of lines in a file. Off the top of my head, I came up with this which worked fine with files that had 100k lines:
main = do path <- liftM head $ getArgs h <- openFile path ReadMode n <- execStateT (countLines h) 0 print n
untilM :: Monad m => (a -> m Bool) -> (a -> m ()) -> a -> m () untilM cond action val = do truthy <- cond val if truthy then return () else action val >> (untilM cond action val)
countLines :: Handle -> StateT Int IO () countLines = untilM (\h -> lift $ hIsEOF h) (\h -> do lift $ hGetLine h modify (+1))
If this makes anyone cringe or cry "you're doing it wrong", I'd actually like to hear it. I never really share my projects, so I don't know how idiosyncratic my style is.
This makes me cry. import System.Environment import qualified Data.ByteString.Lazy.Char8 as B main = do [f] <- getArgs s <- B.readFile f print (B.count '\n' s) Compile it. $ ghc -O2 --make A.hs $ time ./A /usr/share/dict/words 52848 ./A /usr/share/dict/words 0.00s user 0.00s system 93% cpu 0.007 total Against standard tools: $ time wc -l /usr/share/dict/words 52848 /usr/share/dict/words wc -l /usr/share/dict/words 0.01s user 0.00s system 88% cpu 0.008 total -- Don

On Thu, Sep 18, 2008 at 1:29 PM, Don Stewart
wchogg:
Hey Haskell, So for a fairly inane reason, I ended up taking a couple of minutes and writing a program that would spit out, to the console, the number of lines in a file. Off the top of my head, I came up with this which worked fine with files that had 100k lines:
main = do path <- liftM head $ getArgs h <- openFile path ReadMode n <- execStateT (countLines h) 0 print n
untilM :: Monad m => (a -> m Bool) -> (a -> m ()) -> a -> m () untilM cond action val = do truthy <- cond val if truthy then return () else action val >> (untilM cond action val)
countLines :: Handle -> StateT Int IO () countLines = untilM (\h -> lift $ hIsEOF h) (\h -> do lift $ hGetLine h modify (+1))
If this makes anyone cringe or cry "you're doing it wrong", I'd actually like to hear it. I never really share my projects, so I don't know how idiosyncratic my style is.
This makes me cry.
import System.Environment import qualified Data.ByteString.Lazy.Char8 as B
main = do [f] <- getArgs s <- B.readFile f print (B.count '\n' s)
Compile it.
$ ghc -O2 --make A.hs
$ time ./A /usr/share/dict/words 52848 ./A /usr/share/dict/words 0.00s user 0.00s system 93% cpu 0.007 total
Against standard tools:
$ time wc -l /usr/share/dict/words 52848 /usr/share/dict/words wc -l /usr/share/dict/words 0.01s user 0.00s system 88% cpu 0.008 total
So both you & Bryan do essentially the same thing and of course both versions are far better than mine. So the purpose of using the Lazy version of ByteString was so that the file is only incrementally loaded by readFile as count is processing?

wchogg:
On Thu, Sep 18, 2008 at 1:29 PM, Don Stewart
wrote: wchogg:
Hey Haskell, So for a fairly inane reason, I ended up taking a couple of minutes and writing a program that would spit out, to the console, the number of lines in a file. Off the top of my head, I came up with this which worked fine with files that had 100k lines:
main = do path <- liftM head $ getArgs h <- openFile path ReadMode n <- execStateT (countLines h) 0 print n
untilM :: Monad m => (a -> m Bool) -> (a -> m ()) -> a -> m () untilM cond action val = do truthy <- cond val if truthy then return () else action val >> (untilM cond action val)
countLines :: Handle -> StateT Int IO () countLines = untilM (\h -> lift $ hIsEOF h) (\h -> do lift $ hGetLine h modify (+1))
If this makes anyone cringe or cry "you're doing it wrong", I'd actually like to hear it. I never really share my projects, so I don't know how idiosyncratic my style is.
This makes me cry.
import System.Environment import qualified Data.ByteString.Lazy.Char8 as B
main = do [f] <- getArgs s <- B.readFile f print (B.count '\n' s)
Compile it.
$ ghc -O2 --make A.hs
$ time ./A /usr/share/dict/words 52848 ./A /usr/share/dict/words 0.00s user 0.00s system 93% cpu 0.007 total
Against standard tools:
$ time wc -l /usr/share/dict/words 52848 /usr/share/dict/words wc -l /usr/share/dict/words 0.01s user 0.00s system 88% cpu 0.008 total
So both you & Bryan do essentially the same thing and of course both versions are far better than mine. So the purpose of using the Lazy version of ByteString was so that the file is only incrementally loaded by readFile as count is processing?
Yep, that's right The streaming nature is implicit in the lazy bytestring. It's kind of the dual of explicit chunkwise control -- chunk processing reified into the data structure. -- Don

On Thu, Sep 18, 2008 at 1:55 PM, Don Stewart
wchogg:
On Thu, Sep 18, 2008 at 1:29 PM, Don Stewart
wrote: <snip> This makes me cry.
import System.Environment import qualified Data.ByteString.Lazy.Char8 as B
main = do [f] <- getArgs s <- B.readFile f print (B.count '\n' s)
Compile it.
$ ghc -O2 --make A.hs
$ time ./A /usr/share/dict/words 52848 ./A /usr/share/dict/words 0.00s user 0.00s system 93% cpu 0.007 total
Against standard tools:
$ time wc -l /usr/share/dict/words 52848 /usr/share/dict/words wc -l /usr/share/dict/words 0.01s user 0.00s system 88% cpu 0.008 total
So both you & Bryan do essentially the same thing and of course both versions are far better than mine. So the purpose of using the Lazy version of ByteString was so that the file is only incrementally loaded by readFile as count is processing?
Yep, that's right
The streaming nature is implicit in the lazy bytestring. It's kind of the dual of explicit chunkwise control -- chunk processing reified into the data structure.
To ask an overly general question, if lazy bytestring makes a nice provider for incremental processing are there reasons to _not_ reach for that as my default when processing large files?

wchogg:
To ask an overly general question, if lazy bytestring makes a nice provider for incremental processing are there reasons to _not_ reach for that as my default when processing large files?
At the moment, it would always be my first choice. Consider, http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all -- Don

On Thu, Sep 18, 2008 at 12:31 PM, Creighton Hogg
On Thu, Sep 18, 2008 at 1:55 PM, Don Stewart
wrote: wchogg:
On Thu, Sep 18, 2008 at 1:29 PM, Don Stewart
wrote: <snip> This makes me cry.
import System.Environment import qualified Data.ByteString.Lazy.Char8 as B
main = do [f] <- getArgs s <- B.readFile f print (B.count '\n' s)
Compile it.
$ ghc -O2 --make A.hs
$ time ./A /usr/share/dict/words 52848 ./A /usr/share/dict/words 0.00s user 0.00s system 93% cpu 0.007 total
Against standard tools:
$ time wc -l /usr/share/dict/words 52848 /usr/share/dict/words wc -l /usr/share/dict/words 0.01s user 0.00s system 88% cpu 0.008 total
So both you & Bryan do essentially the same thing and of course both versions are far better than mine. So the purpose of using the Lazy version of ByteString was so that the file is only incrementally loaded by readFile as count is processing?
Yep, that's right
The streaming nature is implicit in the lazy bytestring. It's kind of the dual of explicit chunkwise control -- chunk processing reified into the data structure.
To ask an overly general question, if lazy bytestring makes a nice provider for incremental processing are there reasons to _not_ reach for that as my default when processing large files?
Yes. The main time is when you "accidentally" force the whole file (or at least large parts of it) into memory at the same time. Profiling and careful programming seem to be the workarounds, but in a large application the "careful programming" part can become prohibitively expensive. This is due to the sometimes subtle nature of how strictness composes with laziness. This is a the result of a more general issue that it is non-obvious how your program is evaluated at run-time thanks to lazy evaluation, thus making lazy evaluation act as a double edged sword at times. I'm not saying get rid of lazy eval, but occasionally it presents problems for efficiency and diagnosing efficiency problems. The rule seems to be: Write correct code first, fix the problems (usually just inefficiencies) later. Using lazy bytestrings makes it easier to write concise code that is more easily inspected for correctness. Perhaps it is even easier to test such code, but I'm skeptical of that. Thus, I think most people here would agree that reaching first for lazy byte string is preferred over other techniques. Plus, the one of the most common fixes to inefficient haskell programs is to make them lazy in the right places and strict in key places and using lazy bytestring will get you part of the way to that refactoring usually. Jason

dagit:
On Thu, Sep 18, 2008 at 12:31 PM, Creighton Hogg
wrote: On Thu, Sep 18, 2008 at 1:55 PM, Don Stewart
wrote: wchogg:
On Thu, Sep 18, 2008 at 1:29 PM, Don Stewart
wrote: <snip> This makes me cry.
import System.Environment import qualified Data.ByteString.Lazy.Char8 as B
main = do [f] <- getArgs s <- B.readFile f print (B.count '\n' s)
Compile it.
$ ghc -O2 --make A.hs
$ time ./A /usr/share/dict/words 52848 ./A /usr/share/dict/words 0.00s user 0.00s system 93% cpu 0.007 total
Against standard tools:
$ time wc -l /usr/share/dict/words 52848 /usr/share/dict/words wc -l /usr/share/dict/words 0.01s user 0.00s system 88% cpu 0.008 total
So both you & Bryan do essentially the same thing and of course both versions are far better than mine. So the purpose of using the Lazy version of ByteString was so that the file is only incrementally loaded by readFile as count is processing?
Yep, that's right
The streaming nature is implicit in the lazy bytestring. It's kind of the dual of explicit chunkwise control -- chunk processing reified into the data structure.
To ask an overly general question, if lazy bytestring makes a nice provider for incremental processing are there reasons to _not_ reach for that as my default when processing large files?
Yes. The main time is when you "accidentally" force the whole file (or at least large parts of it) into memory at the same time. Profiling and careful programming seem to be the workarounds, but in a large application the "careful programming" part can become prohibitively expensive. This is due to the sometimes subtle nature of how strictness composes with laziness. This is a the result of a more general issue that it is non-obvious how your program is evaluated at run-time thanks to lazy evaluation, thus making lazy evaluation act as a double edged sword at times. I'm not saying get rid of lazy eval, but occasionally it presents problems for efficiency and diagnosing efficiency problems.
The rule seems to be: Write correct code first, fix the problems (usually just inefficiencies) later.
Using lazy bytestrings makes it easier to write concise code that is more easily inspected for correctness. Perhaps it is even easier to test such code, but I'm skeptical of that. Thus, I think most people here would agree that reaching first for lazy byte string is preferred over other techniques. Plus, the one of the most common fixes to inefficient haskell programs is to make them lazy in the right places and strict in key places and using lazy bytestring will get you part of the way to that refactoring usually.
Work on the "dual" of lazy bytestrings -- chunked enumerators -- may lead to more options in this area. The question of compositionality of left-fold enumerators remains (afaik), but we'll see. -- Don

"Creighton Hogg"
To ask an overly general question, if lazy bytestring makes a nice provider for incremental processing are there reasons to _not_ reach for that as my default when processing large files?
I think it is a nice default. I'd reach for strict bytestrings if I know the file will be processed in a strict manner (not single-pass stream-through), and I just have to have the last few percent speedup. I'll use [String] only for small examples, where the extra imports cost more than the performance loss. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Thu, Sep 18, 2008 at 1:55 PM, Don Stewart
wchogg:
On Thu, Sep 18, 2008 at 1:29 PM, Don Stewart
wrote: wchogg:
Hey Haskell, So for a fairly inane reason, I ended up taking a couple of minutes and writing a program that would spit out, to the console, the number of lines in a file. Off the top of my head, I came up with this which worked fine with files that had 100k lines:
main = do path <- liftM head $ getArgs h <- openFile path ReadMode n <- execStateT (countLines h) 0 print n
untilM :: Monad m => (a -> m Bool) -> (a -> m ()) -> a -> m () untilM cond action val = do truthy <- cond val if truthy then return () else action val >> (untilM cond action val)
countLines :: Handle -> StateT Int IO () countLines = untilM (\h -> lift $ hIsEOF h) (\h -> do lift $ hGetLine h modify (+1))
If this makes anyone cringe or cry "you're doing it wrong", I'd actually like to hear it. I never really share my projects, so I don't know how idiosyncratic my style is.
This makes me cry.
import System.Environment import qualified Data.ByteString.Lazy.Char8 as B
main = do [f] <- getArgs s <- B.readFile f print (B.count '\n' s)
Compile it.
$ ghc -O2 --make A.hs
$ time ./A /usr/share/dict/words 52848 ./A /usr/share/dict/words 0.00s user 0.00s system 93% cpu 0.007 total
Against standard tools:
$ time wc -l /usr/share/dict/words 52848 /usr/share/dict/words wc -l /usr/share/dict/words 0.01s user 0.00s system 88% cpu 0.008 total
So both you & Bryan do essentially the same thing and of course both versions are far better than mine. So the purpose of using the Lazy version of ByteString was so that the file is only incrementally loaded by readFile as count is processing?
Yep, that's right
The streaming nature is implicit in the lazy bytestring. It's kind of the dual of explicit chunkwise control -- chunk processing reified into the data structure.
Hi Don, I have a bit more of a followup, actually. You make use of the built in bytestring consumer count, which itself is built upon the foldlChunks function which is only exported in the ByteString.Lazy.Internal. If I want to make my own efficient bytestring consumer, is that what I need to use in order to preserve the inherent laziness of the datastructure? Also, I feel a little at a loss for how to make a good bytestring producer for efficiently _writing_ large swaths of data via writeFile. Would it be possible to whip up a small example? Oh, and lastly, I apologize to both you & Bryan for making you cry. I hope you can forgive my cruelty. Thanks, Creighton

wchogg:
Hi Don, I have a bit more of a followup, actually. You make use of the built in bytestring consumer count, which itself is built upon the foldlChunks function which is only exported in the ByteString.Lazy.Internal. If I want to make my own efficient bytestring consumer, is that what I need to use in order to preserve the inherent laziness of the datastructure?
you can get foldChunks from Data.ByteString.Lazy.Internal, or write your own chunk folder.
Also, I feel a little at a loss for how to make a good bytestring producer for efficiently _writing_ large swaths of data via writeFile. Would it be possible to whip up a small example?
Using unfoldr? Or Data.Binary?
Oh, and lastly, I apologize to both you & Bryan for making you cry. I hope you can forgive my cruelty.
:)

Don Stewart
If I want to make my own efficient bytestring consumer, is that what I need to use in order to preserve the inherent laziness of the datastructure?
you can get foldChunks from Data.ByteString.Lazy.Internal, or write your own chunk folder.
IME you can also get nicely by using the standard list-alikes: uncons, head, tail, take, drop... -k -- If I haven't seen further, it is by standing in the footprints of giants

Am Freitag, 19. September 2008 03:14 schrieb Robert Greayer:
--- On Thu, 9/18/08, Creighton Hogg
wrote: If this makes anyone cringe or cry "you're doing it wrong", I'd actually like to hear it.
Just to make everyone cry:
main = getArgs >>= \(x:_) -> system ("wc -l " ++ x)
Ouch!

On Fri, 2008-09-19 at 06:38 +0200, Daniel Fischer wrote:
Am Freitag, 19. September 2008 03:14 schrieb Robert Greayer:
--- On Thu, 9/18/08, Creighton Hogg
wrote: If this makes anyone cringe or cry "you're doing it wrong", I'd actually like to hear it.
Just to make everyone cry:
main = getArgs >>= \(x:_) -> system ("wc -l " ++ x)
Ouch!
Indeed. main = getArgs >>= system . ("wc -l "++) . head

Without any fancy byte strings:
main = do
name:_ <- getArgs
file <- readFile name
print $ length $ lines file
On Thu, Sep 18, 2008 at 6:02 PM, Creighton Hogg
Hey Haskell, So for a fairly inane reason, I ended up taking a couple of minutes and writing a program that would spit out, to the console, the number of lines in a file. Off the top of my head, I came up with this which worked fine with files that had 100k lines:
main = do path <- liftM head $ getArgs h <- openFile path ReadMode n <- execStateT (countLines h) 0 print n
untilM :: Monad m => (a -> m Bool) -> (a -> m ()) -> a -> m () untilM cond action val = do truthy <- cond val if truthy then return () else action val >> (untilM cond action val)
countLines :: Handle -> StateT Int IO () countLines = untilM (\h -> lift $ hIsEOF h) (\h -> do lift $ hGetLine h modify (+1))
If this makes anyone cringe or cry "you're doing it wrong", I'd actually like to hear it. I never really share my projects, so I don't know how idiosyncratic my style is. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (10)
-
Bryan O'Sullivan
-
Creighton Hogg
-
Daniel Fischer
-
Derek Elkins
-
Don Stewart
-
Jason Dagit
-
Ketil Malde
-
Kurt Hutchinson
-
Lennart Augustsson
-
Robert Greayer