
Lennart Augustsson wrote
main = do name:_ <- getArgs file <- readFile name print $ length $ lines file
Given the stance against top-level mutable variables, I have not expected to see this Lazy IO code. After all, what could be more against the spirit of Haskell than a `pure' function with observable side effects. With Lazy IO, one indeed has to choose between correctness and performance. The appearance of such code is especially strange after the evidence of deadlocks with Lazy IO, presented on this list less than a month ago. Let alone unpredictable resource usage and reliance on finalizers to close files (forgetting that GHC does not guarantee that finalizers will be run at all). Is there an alternative? -- Counting the lines in a file import IterateeM count_nl = liftI $ IE_cont (step 0) where step acc (Chunk str) = liftI $ IE_cont (step $! acc + count str) step acc stream = liftI $ IE_done acc stream count [] = 0 count ('\n':str) = succ $! count str count (_:str) = count str main = do name:_ <- getArgs IE_done counter _ <- unIM $ enum_file name >. enum_eof ==<< count_nl print counter The function count_nl could have been in the library, but I'm a minimalist. It is written in a declarative rather than imperative style, and one easily sees what it does. The above code as well as the IterateeM library is Haskell98. It does not use any unsafe Haskell functions whatsoever. time wc -l /usr/share/dict/words 235882 /usr/share/dict/words real 0m0.024s user 0m0.022s sys 0m0.000s time ~/Docs/papers/DEFUN08/Wc /usr/share/dict/words 235882 real 0m0.141s user 0m0.126s sys 0m0.008s To compare with lazy IO, the code using readFile gives time ~/Docs/papers/DEFUN08/Wc /usr/share/dict/words 235882 real 0m0.297s user 0m0.262s sys 0m0.023s So, choosing correctness does not mean losing in performance; in fact, one may even gain. Can enumerators compose? Well, we already seen the example above (enum_file name >. enum_eof) where the operation (>.) e1 >. e2 = (==<<) e2 . e1 is a flipped composition if monadic bind were considered a flipped application. Here is a more interesting example: count words in all the files whose names are given on the command line. There may be many files given, thousands of them. -- Count the stream. Again, could have been in the library stream_count :: Monad m => IterateeGM el m Int stream_count = liftI $ IE_cont (step 0) where step acc (Chunk []) = liftI $ IE_cont (step acc) step acc (Chunk [_]) = liftI $ IE_cont (step $! succ acc) step acc (Chunk ls) = liftI $ IE_cont (step $! acc + length ls) step acc stream = liftI $ IE_done acc stream main = do names <- getArgs let enumerators = foldr (\name -> (enum_file name >.)) enum_eof names IE_done (IE_done counter _) _ <- unIM $ enumerators ==<< enum_words stream_count print counter We notice that the composition of enumerators corresponds to the `concatenation' of their sources. Declaratively, the meaning of the above code is: -- all the given files are concatenated -- the resulting stream of characters is converted to a stream of words -- the stream of words is counted. Operationally, the code does not open more than one file at a time. More importantly, the code *never* reads more than 4096 characters at a time. A block of the file is read, split into words, counted, and only then another chunk is read. After one file is done, it is closed, and another file is processed. One can see that only one file is being opened at a time by enabling traces. The processing is fully incremental. /usr/local/share/doc/ghc6> find . -name \*.html -print | time xargs ~/Docs/papers/DEFUN08/Wc 3043421 16.99 real 15.83 user 0.71 sys BTW, the program has counted words in 1169 files. It is interesting to compare the above main function with the corresponding lazy IO: main'' = do names <- getArgs files <- mapM readFile names print $ length $ words (concat files) The number of lines is comparable. The execution is not. If we try to run the lazy IO code, we get: /usr/local/share/doc/ghc6> find . -name \*.html -print | time xargs ~/Docs/papers/DEFUN08/Wc Wc: ./libraries/Win32/Graphics-Win32-GDI-Path.html: openFile: resource exhausted (Too many open files)

oleg:
Given the stance against top-level mutable variables, I have not expected to see this Lazy IO code. After all, what could be more against the spirit of Haskell than a `pure' function with observable side effects. With Lazy IO, one indeed has to choose between correctness and performance. The appearance of such code is especially strange after the evidence of deadlocks with Lazy IO, presented on this list less than a month ago. Let alone unpredictable resource usage and reliance on finalizers to close files (forgetting that GHC does not guarantee that finalizers will be run at all).
Is there an alternative?
Hi Oleg! I'm glad you joined the thread at this point. Some background: our best solutions for this problem using lazy IO, are based on chunk-wise lazy data structures, typically lazy bytestrings. Often we'll write programs like: import qualified Data.ByteString.Lazy.Char8 as B import System.Environment main = do [f] <- getArgs s <- B.readFile f print (B.count '\n' s) Which are nicely efficient $ ghc -O2 A.hs --make $ du -hs data 100M data $ time ./A data 11078540 ./A data 0.17s user 0.04s system 100% cpu 0.210 total And we know from elsewhere the performance is highlycompetitive: http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all Now, enumerators are very promising, and there's a lot of interest at the moment, (e.g. just this week, Johan Tibell gave an inspiring talk at Galois about this approach to IO, http://www.galois.com/blog/2008/09/12/left-fold-enumerators-a-safe-expressiv... and we spent the day sketching out an enumerator bytestring design, But there are some open questions. Perhaps you have some answers? * Can we write a Data.ByteString.Enumerator that has matching or better performance than its "dual", the existing chunk-wise lazy stream type? * Is there a translation from data ByteString = Empty | Chunk {-# UNPACK #-} !S.ByteString ByteString and functions on this type, foldlChunks :: (a -> S.ByteString -> a) -> a -> ByteString -> a foldlChunks f z = go z where go !a Empty = a go !a (Chunk c cs) = go (f a c) cs to an enumerator implementation? * Can we compose enumerators as we can stream functions? * Can we do fusion on enumerators? Does that make composition easier? (Indeed, is there an encoding of enumerators analogous to stream fusion control?) Any thoughts? -- Don

oleg@okmij.org writes:
It is interesting to compare the above main function with the corresponding lazy IO:
Minor point I know, but aren't you really comparing it with the corresponding *strict* IO?
main'' = do names <- getArgs files <- mapM readFile names ^^^^
print $ length $ words (concat files)
This works nicely if you replace the middle line with a lazy version, e.g.: files <- mapM (unsafeInterleaveIO . B.readFile) names -k -- If I haven't seen further, it is by standing in the footprints of giants

oleg-30 wrote:
I have not expected to see this Lazy IO code. After all, what could be more against the spirit of Haskell than a `pure' function with observable side effects.
What could even be more against the spirit of Haskell than the original theme of this thread, i.e. code that makes us cry? Lennart's piece fudges purity, agreed, but it reads nicely as idiomatic Haskell, swift on the eyes if not on the machine. Consider if readFile's semantics were modified, i.e. not lazy, at least not always. In the ideal world, a smart enough compiler would just do the right thing, i.e. the IO String returned would be strict, or better yet, it would automatically chunkify the read to obtain constant space usage. "Lazy IO" is indeed a nasty can of worms, not unrelated to the issue of monadic IO as a gigantic sin bin. We could avoid it entirely, or we could sort out and algebraize the different interactions into a happier marriage of the pair. -- View this message in context: http://www.nabble.com/Lazy-vs-correct-IO--Was%3A-A-round-of-golf--tp19567128... Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.

I agree that lazy IO is a can with some worms in it. But it's not that strange.
The readFile operation is in the IO monad, so it has an effect on the world.
This effect is not finished when readFile returns, and from the world
point of view
it's not entirely deterministic.
On Fri, Sep 19, 2008 at 7:51 AM,
Lennart Augustsson wrote
main = do name:_ <- getArgs file <- readFile name print $ length $ lines file
Given the stance against top-level mutable variables, I have not expected to see this Lazy IO code. After all, what could be more against the spirit of Haskell than a `pure' function with observable side effects. With Lazy IO, one indeed has to choose between correctness and performance. The appearance of such code is especially strange after the evidence of deadlocks with Lazy IO, presented on this list less than a month ago. Let alone unpredictable resource usage and reliance on finalizers to close files (forgetting that GHC does not guarantee that finalizers will be run at all).
Is there an alternative?
-- Counting the lines in a file import IterateeM
count_nl = liftI $ IE_cont (step 0) where step acc (Chunk str) = liftI $ IE_cont (step $! acc + count str) step acc stream = liftI $ IE_done acc stream count [] = 0 count ('\n':str) = succ $! count str count (_:str) = count str
main = do name:_ <- getArgs IE_done counter _ <- unIM $ enum_file name >. enum_eof ==<< count_nl print counter
The function count_nl could have been in the library, but I'm a minimalist. It is written in a declarative rather than imperative style, and one easily sees what it does. The above code as well as the IterateeM library is Haskell98. It does not use any unsafe Haskell functions whatsoever.
time wc -l /usr/share/dict/words 235882 /usr/share/dict/words
real 0m0.024s user 0m0.022s sys 0m0.000s
time ~/Docs/papers/DEFUN08/Wc /usr/share/dict/words 235882
real 0m0.141s user 0m0.126s sys 0m0.008s
To compare with lazy IO, the code using readFile gives
time ~/Docs/papers/DEFUN08/Wc /usr/share/dict/words 235882
real 0m0.297s user 0m0.262s sys 0m0.023s
So, choosing correctness does not mean losing in performance; in fact, one may even gain.
Can enumerators compose? Well, we already seen the example above (enum_file name >. enum_eof) where the operation (>.) e1 >. e2 = (==<<) e2 . e1 is a flipped composition if monadic bind were considered a flipped application.
Here is a more interesting example: count words in all the files whose names are given on the command line. There may be many files given, thousands of them.
-- Count the stream. Again, could have been in the library stream_count :: Monad m => IterateeGM el m Int stream_count = liftI $ IE_cont (step 0) where step acc (Chunk []) = liftI $ IE_cont (step acc) step acc (Chunk [_]) = liftI $ IE_cont (step $! succ acc) step acc (Chunk ls) = liftI $ IE_cont (step $! acc + length ls) step acc stream = liftI $ IE_done acc stream
main = do names <- getArgs let enumerators = foldr (\name -> (enum_file name >.)) enum_eof names IE_done (IE_done counter _) _ <- unIM $ enumerators ==<< enum_words stream_count print counter
We notice that the composition of enumerators corresponds to the `concatenation' of their sources. Declaratively, the meaning of the above code is: -- all the given files are concatenated -- the resulting stream of characters is converted to a stream of words -- the stream of words is counted.
Operationally, the code does not open more than one file at a time. More importantly, the code *never* reads more than 4096 characters at a time. A block of the file is read, split into words, counted, and only then another chunk is read. After one file is done, it is closed, and another file is processed. One can see that only one file is being opened at a time by enabling traces. The processing is fully incremental.
/usr/local/share/doc/ghc6> find . -name \*.html -print | time xargs ~/Docs/papers/DEFUN08/Wc 3043421 16.99 real 15.83 user 0.71 sys
BTW, the program has counted words in 1169 files.
It is interesting to compare the above main function with the corresponding lazy IO:
main'' = do names <- getArgs files <- mapM readFile names print $ length $ words (concat files)
The number of lines is comparable. The execution is not. If we try to run the lazy IO code, we get:
/usr/local/share/doc/ghc6> find . -name \*.html -print | time xargs ~/Docs/papers/DEFUN08/Wc Wc: ./libraries/Win32/Graphics-Win32-GDI-Path.html: openFile: resource exhausted (Too many open files)
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Fri, 2008-09-19 at 16:30 +0100, Lennart Augustsson wrote:
I agree that lazy IO is a can with some worms in it. But it's not that strange. The readFile operation is in the IO monad, so it has an effect on the world. This effect is not finished when readFile returns, and from the world point of view it's not entirely deterministic.
On operating systems that fail to maintain file system consistency. Blaming an effect of an *operating system* misfeature on a *language* feature is somewhat perverse. jcc

On Fri, Sep 19, 2008 at 2:51 AM,
Lennart Augustsson wrote
main = do name:_ <- getArgs file <- readFile name print $ length $ lines file
Given the stance against top-level mutable variables, I have not expected to see this Lazy IO code. After all, what could be more against the spirit of Haskell than a `pure' function with observable side effects. With Lazy IO, one indeed has to choose between correctness and performance. The appearance of such code is especially strange after the evidence of deadlocks with Lazy IO, presented on this list less than a month ago. Let alone unpredictable resource usage and reliance on finalizers to close files (forgetting that GHC does not guarantee that finalizers will be run at all).
Is there an alternative?
-- Counting the lines in a file import IterateeM
count_nl = liftI $ IE_cont (step 0) where step acc (Chunk str) = liftI $ IE_cont (step $! acc + count str) step acc stream = liftI $ IE_done acc stream count [] = 0 count ('\n':str) = succ $! count str count (_:str) = count str
main = do name:_ <- getArgs IE_done counter _ <- unIM $ enum_file name >. enum_eof ==<< count_nl print counter
The function count_nl could have been in the library, but I'm a minimalist. It is written in a declarative rather than imperative style, and one easily sees what it does. The above code as well as the IterateeM library is Haskell98. It does not use any unsafe Haskell functions whatsoever.
Is the IterateeM library available on-line anywhere? I'm familiar
enough with your earlier work on enumerators that I can guess what
most of what that code is doing, but I'd like a better idea of what
==<< does.
--
Dave Menendez

2008/09/18
Operationally, the code does not open more than one file at a time. More importantly, the code *never* reads more than 4096 characters at a time. A block of the file is read, split into words, counted, and only then another chunk is read. After one file is done, it is closed, and another file is processed. One can see that only one file is being opened at a time by enabling traces. The processing is fully incremental.
It opens and closes each file in turn; but it would it be unwise to open and close each file as we'd read a chunk from it? This would allow arbitrary interleaving. -- Jason Dusek

I hate to say it; but you know you can tweak the OS to allow excessive file handle usage. I once wrote a Haskell script to empty a very, vary large S3 bucket. On Linux, I had to put it in a shell while loop to keep it going, due to file handle exhaustion. On my Mac it ran without incident. :; ulimit unlimited Turns out the `ulimit` on my Mac is pretty high. -- Jason Dusek |...tweak the OS...| http://www.kegel.com/c10k.html#limits.filehandles

Oh, curses. I didn't run it with the right option. :; ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) 6144 file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 256 pipe size (512 bytes, -p) 1 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 266 virtual memory (kbytes, -v) unlimited So now I'm not sure why it worked on my Mac. -- Jason Dusek

It depends on the underlying file control used by ghc. if it's the FILE
stream pointer, some implementations suffer from a 255 file limit. If it's a
standard file descriptor (open instead of fopen), then it's limited by
ulimit.
On Sun, Apr 5, 2009 at 10:35 AM, Jason Dusek
Oh, curses. I didn't run it with the right option.
:; ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) 6144 file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 256 pipe size (512 bytes, -p) 1 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 266 virtual memory (kbytes, -v) unlimited
So now I'm not sure why it worked on my Mac.
-- Jason Dusek _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- We can't solve problems by using the same kind of thinking we used when we created them. - A. Einstein

It opens and closes each file in turn; but it would it be unwise to open and close each file as we'd read a chunk from it? This would allow arbitrary interleaving.
If I understand you correctly, you are proposing processing several files in parallel, so to interleave IO. If the `files' in question are communication pipes, or if KAIO (kernel asynchronous IO) is available, it is indeed a good strategy. The last example in the file http://okmij.org/ftp/Haskell/Iteratee/IterateeM.hs (called test_driver_mux) demonstrates how to interleave IO with Iteratees. Iteratees of course do not care how the source data have been obtained, with or without interleaving.

Is the choice of whether or not to open/close with each chunk read something that we can reasonably hide from the I/O API's user? There is at least one way in which is semantically distinct -- that old trick of opening a tempfile and then unlinking it to hide it. It may be the sort of thing that you do on demand, too -- we have a file handle pool and as we run out of handles we switch to opening/closing. For a single really long read, opening/closing every 4k is just churn; if your doing thousands of long reads at once, though, it can't be helped. -- Jason Dusek
participants (9)
-
David Menendez
-
Don Stewart
-
Jason Dusek
-
Jonathan Cast
-
Ketil Malde
-
Kim-Ee Yeoh
-
Lennart Augustsson
-
oleg@okmij.org
-
Rick R