[Haskell-cafe] ANNOUNCE: enumerator, an alternative iteratee package

Most of you have probably read Oleg's essays on using left-fold enumerators for incremental IO. In short, by encapsulating monadic left-folds in an "Iteratee" type, incremental pure processing is possible without using lazy IO. Sources to read: Oleg: Streams and Iteratees < http://okmij.org/ftp/Streams.html > Magnus Therning: Trying to work out iteratees < http://therning.org/magnus/archives/735 > cdsmith: Iteratees Step By Step (Part 1) < http://cdsmith.wordpress.com/2010/05/23/iteratees-step-by-step-part-1/
John Millikin (me): Understanding Iteratees < http://ianen.org/articles/understanding-iteratees/ > Currently, the primary package for left-fold enumerators is John Lato's "iteratee". It is based on Oleg's original code, extended to support various forms of containers, platform-specific IO, and codecs for the WAV and TIFF formats. While I appreciate Mr. Lato's development of the package, I find it far too large, and its documentation too sparse, to effectively use. To correct this, I've written the "enumerator" package. It is also derived from Oleg's IterateeM.hs , but with a simplified API and significantly reduced dependency list. Hackage entry: http://hackage.haskell.org/package/enumerator Haddock docs: http://ianen.org/haskell/enumerator/api-docs/ Source code (literate PDF): http://ianen.org/haskell/enumerator/enumerator.pdf darcs get http://ianen.org/haskell/enumerator/ Additionally, I've included examples of using enumerators to implement simplified versions of the "cat" and "wc" utilities. These should serve as a useful starting point for anybody who wants to use enumerators in their own code: http://patch-tag.com/r/jmillikin/enumerator/snapshot/current/content/pretty/... http://patch-tag.com/r/jmillikin/enumerator/snapshot/current/content/pretty/... There are already a few libraries using the existing "iteratee" package (snap, attoparsec-iteratee, hexpat-iteratee); I am very interested in advice from the authors of these libraries. In particular, are any of the removed features (ListLike, WrappedByteString, seeking) something your libraries depend on? Are there any useful combinators you'd like to see included?

On 19 August 2010 14:31, John Millikin
Currently, the primary package for left-fold enumerators is John Lato's "iteratee". It is based on Oleg's original code, extended to support various forms of containers, platform-specific IO, and codecs for the WAV and TIFF formats.
While I appreciate Mr. Lato's development of the package, I find it far too large, and its documentation too sparse, to effectively use. To correct this, I've written the "enumerator" package. It is also derived from Oleg's IterateeM.hs , but with a simplified API and significantly reduced dependency list.
For those of us not that familiar with the iteratee library, can you provide an example of your your enumerator library is simpler? Do you mean that you have a smaller exposed API, or that it has more in-built combinators, etc.? -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Wed, Aug 18, 2010 at 21:35, Ivan Lazar Miljenovic
For those of us not that familiar with the iteratee library, can you provide an example of your your enumerator library is simpler? Do you mean that you have a smaller exposed API, or that it has more in-built combinators, etc.?
It has 2 exposed modules (vs 14), does not include TIFF or WAV parsers, uses [] instead of ListLike, and discards StreamChunk in favor of just "data Stream = Chunks [] | EOF". It has three dependencies (base, bytestring, transformers) rather than roughly 20. Currently mine probably has fewer built-in combinators, because it has fewer everything, but if it's missing anything somebody uses I will be happy to add it. Most of the functions available in Oleg's original code (and thus in "iteratee") appear to be for instructive purposes, rather than useful.

John,
This package looks very promising. I used iteratee for the yaml package, but
I had many of the concerns that you have mentioned below. Version 0.2 of
persistent is going to have some form of an enumerator interface for getting
the results of a query, and I eventually decided that iteratee was
introducing too much complexity to be a good candidate. However, I was able
to port the package[1] over to enumerator in about half an hour; I
especially benefited from your example applications.
The only concern that I had was the possible inefficiency of representing
all chunks as a list. In the case of persistent, the enumerator will
*always* generate a one-lengthed list, and the most common operation is
selectList, which returns all results as a list. If I used your consume
function, I believe there would be a *lot* of list traversals. Instead,
selectList[2] uses ([a] -> [a]) for building up the result internally. I
haven't really thought the issue through fully, so I can recommend anything
better. Perhaps more importantly, the simplification introduced by just
dealing with lists is well received.
Keep up the good work, I look forward to seeing more about enumerator.
Michael
[1] http://github.com/snoyberg/persistent/tree/enumerator
[2]
http://github.com/snoyberg/persistent/blob/enumerator/Database/Persist/Base....
http://github.com/snoyberg/persistent/tree/enumerator
On Thu, Aug 19, 2010 at 7:31 AM, John Millikin
Most of you have probably read Oleg's essays on using left-fold enumerators for incremental IO. In short, by encapsulating monadic left-folds in an "Iteratee" type, incremental pure processing is possible without using lazy IO. Sources to read:
Oleg: Streams and Iteratees < http://okmij.org/ftp/Streams.html > Magnus Therning: Trying to work out iteratees < http://therning.org/magnus/archives/735 > cdsmith: Iteratees Step By Step (Part 1) < http://cdsmith.wordpress.com/2010/05/23/iteratees-step-by-step-part-1/
John Millikin (me): Understanding Iteratees < http://ianen.org/articles/understanding-iteratees/ >
Currently, the primary package for left-fold enumerators is John Lato's "iteratee". It is based on Oleg's original code, extended to support various forms of containers, platform-specific IO, and codecs for the WAV and TIFF formats.
While I appreciate Mr. Lato's development of the package, I find it far too large, and its documentation too sparse, to effectively use. To correct this, I've written the "enumerator" package. It is also derived from Oleg's IterateeM.hs , but with a simplified API and significantly reduced dependency list.
Hackage entry: http://hackage.haskell.org/package/enumerator Haddock docs: http://ianen.org/haskell/enumerator/api-docs/ Source code (literate PDF): http://ianen.org/haskell/enumerator/enumerator.pdf
darcs get http://ianen.org/haskell/enumerator/
Additionally, I've included examples of using enumerators to implement simplified versions of the "cat" and "wc" utilities. These should serve as a useful starting point for anybody who wants to use enumerators in their own code:
http://patch-tag.com/r/jmillikin/enumerator/snapshot/current/content/pretty/...
http://patch-tag.com/r/jmillikin/enumerator/snapshot/current/content/pretty/...
There are already a few libraries using the existing "iteratee" package (snap, attoparsec-iteratee, hexpat-iteratee); I am very interested in advice from the authors of these libraries. In particular, are any of the removed features (ListLike, WrappedByteString, seeking) something your libraries depend on? Are there any useful combinators you'd like to see included? _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I'm glad to hear you found it useful!
For implementing selectList, have you considered using the "consume"
iteratee? I suspect it would simplify your code to something like:
selectList a b c d = do
res <- run $ select a b c d ==<< consume
case res of
Left e -> error e
Right x -> return x
You might also want to look at the iteratee combinators (returnI,
yield, continue), which would reduce some boilerplate, eg:
"Iteratee $ return $ Continue k" becomes "continue k"
"Iteratee $ return $ Yield x EOF" becomes "yield x EOF"
On Wed, Aug 18, 2010 at 22:24, Michael Snoyman
John, This package looks very promising. I used iteratee for the yaml package, but I had many of the concerns that you have mentioned below. Version 0.2 of persistent is going to have some form of an enumerator interface for getting the results of a query, and I eventually decided that iteratee was introducing too much complexity to be a good candidate. However, I was able to port the package[1] over to enumerator in about half an hour; I especially benefited from your example applications.
The only concern that I had was the possible inefficiency of representing all chunks as a list. In the case of persistent, the enumerator will *always* generate a one-lengthed list, and the most common operation is selectList, which returns all results as a list. If I used your consume function, I believe there would be a *lot* of list traversals. Instead, selectList[2] uses ([a] -> [a]) for building up the result internally. I haven't really thought the issue through fully, so I can recommend anything better. Perhaps more importantly, the simplification introduced by just dealing with lists is well received. Keep up the good work, I look forward to seeing more about enumerator. Michael [1] http://github.com/snoyberg/persistent/tree/enumerator [2] http://github.com/snoyberg/persistent/blob/enumerator/Database/Persist/Base....

I'd mentioned in my previous email that I was concerned that "consume" would
traverse the entire list for each new record pulled from the database. I'll
try to start using the combinators instead; do get started with, it was
easier to just use the raw datatypes to more easily see what was happening.
Would this version of consume perhaps be better:
consume :: Monad m => Iteratee e a m [a]
consume = liftI $ step id where
step acc chunk =
case chunk of
Chunks [] -> Continue $ returnI . step acc
Chunks xs -> Continue $ returnI . (step $ acc . (xs ++))
EOF -> Yield (acc []) EOF
Michael
On Thu, Aug 19, 2010 at 8:40 AM, John Millikin
I'm glad to hear you found it useful!
For implementing selectList, have you considered using the "consume" iteratee? I suspect it would simplify your code to something like:
selectList a b c d = do res <- run $ select a b c d ==<< consume case res of Left e -> error e Right x -> return x
You might also want to look at the iteratee combinators (returnI, yield, continue), which would reduce some boilerplate, eg:
"Iteratee $ return $ Continue k" becomes "continue k" "Iteratee $ return $ Yield x EOF" becomes "yield x EOF"
On Wed, Aug 18, 2010 at 22:24, Michael Snoyman
wrote: John, This package looks very promising. I used iteratee for the yaml package, but I had many of the concerns that you have mentioned below. Version 0.2 of persistent is going to have some form of an enumerator interface for getting the results of a query, and I eventually decided that iteratee was introducing too much complexity to be a good candidate. However, I was able to port the package[1] over to enumerator in about half an hour; I especially benefited from your example applications.
The only concern that I had was the possible inefficiency of representing all chunks as a list. In the case of persistent, the enumerator will *always* generate a one-lengthed list, and the most common operation is selectList, which returns all results as a list. If I used your consume function, I believe there would be a *lot* of list traversals. Instead, selectList[2] uses ([a] -> [a]) for building up the result internally. I haven't really thought the issue through fully, so I can recommend anything better. Perhaps more importantly, the simplification introduced by just dealing with lists is well received. Keep up the good work, I look forward to seeing more about enumerator. Michael [1] http://github.com/snoyberg/persistent/tree/enumerator [2] http://github.com/snoyberg/persistent/blob/enumerator/Database/Persist/Base....

Ah, I missed that sentence. You're correct, a version of consume based
on CPS would probably perform better than Oleg's implementation.
I'll cut a new release later today, or tomorrow, which will include
your implementation.
On Wed, Aug 18, 2010 at 22:48, Michael Snoyman
I'd mentioned in my previous email that I was concerned that "consume" would traverse the entire list for each new record pulled from the database. I'll try to start using the combinators instead; do get started with, it was easier to just use the raw datatypes to more easily see what was happening. Would this version of consume perhaps be better: consume :: Monad m => Iteratee e a m [a] consume = liftI $ step id where step acc chunk = case chunk of Chunks [] -> Continue $ returnI . step acc Chunks xs -> Continue $ returnI . (step $ acc . (xs ++)) EOF -> Yield (acc []) EOF Michael

Hi John,
Thanks for creating a competitor to the iteratee library. I think iteratees
are an important abstraction, but there are some things about the iteratee
library that I'm not fond of, despite John Lato doing a great job. I think
having a bit of healthy competition to explore the design space is
excellent.
I have questions for you below.
On Wed, Aug 18, 2010 at 9:31 PM, John Millikin
Most of you have probably read Oleg's essays on using left-fold enumerators for incremental IO. In short, by encapsulating monadic left-folds in an "Iteratee" type, incremental pure processing is possible without using lazy IO. Sources to read:
[snip]
While I appreciate Mr. Lato's development of the package, I find it far too large, and its documentation too sparse, to effectively use. To correct this, I've written the "enumerator" package. It is also derived from Oleg's IterateeM.hs , but with a simplified API and significantly reduced dependency list.
I don't mind the dependency list, but I was mildly concerned that iteratee appears to work only on unix and that the API is a bit rough.
Hackage entry: http://hackage.haskell.org/package/enumerator Haddock docs: http://ianen.org/haskell/enumerator/api-docs/ Source code (literate PDF): http://ianen.org/haskell/enumerator/enumerator.pdf
darcs get http://ianen.org/haskell/enumerator/
Additionally, I've included examples of using enumerators to implement simplified versions of the "cat" and "wc" utilities. These should serve as a useful starting point for anybody who wants to use enumerators in their own code:
http://patch-tag.com/r/jmillikin/enumerator/snapshot/current/content/pretty/...
http://patch-tag.com/r/jmillikin/enumerator/snapshot/current/content/pretty/...
The main reason I would use iteratees is for performance reasons. To help me, as a potential consumer of your library, could you please provide benchmarks for comparing the performance of enumerator with say, a) iteratee, b) lazy/strict bytestring, and c) Prelude functions? I'm interested in both max memory consumption and run-times. Using criterion and/or progression to get the run-times would be icing on an already delicious cake!
There are already a few libraries using the existing "iteratee" package (snap, attoparsec-iteratee, hexpat-iteratee); I am very interested in advice from the authors of these libraries. In particular, are any of the removed features (ListLike, WrappedByteString, seeking) something your libraries depend on? Are there any useful combinators you'd like to see included?
The only reason iteratee provides WrappedByteString is because the type class used to abstract over the stream type requires something with kind * -> * and ByteString has kind *. The extra wrapping just adds an ignored phantom type to bytestrings. So if you don't require specific kinds I don't think you'd need to provide a WrappedByteString. ListLike is possibly nice, but in the type indexed iteratee implementation that I started (but could not finish due to some issues with the type indexing) I didn't use it. ListLike doesn't support type threaded lists at all. On a side note, in my type threaded iteratee library, I initially elided StreamChunk but later added something similar in because I found it useful. I can't recall of the top of my head what the reasoning was, but I could dig deeper if it interests you. I was also following a fairly faithful re-implementation of John Lato's implementation, just with type indexing. I should probably post my partial library regardless. Perhaps others can find ways around the bits I was stuck on. I can see seeking as being important as your library moves into new domains of use. Particularly when reading large binary streams when the data is sparse. Thanks and congrats! Jason

On Wed, Aug 18, 2010 at 23:33, Jason Dagit
The main reason I would use iteratees is for performance reasons. To help me, as a potential consumer of your library, could you please provide benchmarks for comparing the performance of enumerator with say, a) iteratee, b) lazy/strict bytestring, and c) Prelude functions? I'm interested in both max memory consumption and run-times. Using criterion and/or progression to get the run-times would be icing on an already delicious cake!
Oleg has some benchmarks of his implementation at < http://okmij.org/ftp/Haskell/Iteratee/Lazy-vs-correct.txt >, which clock iteratees at about twice as fast as lazy IO. He also compares them to a native "wc", but his comparison is flawed, because he's comparing a String iteratee vs byte-based wc. I'll benchmark my "wc" and "cat" against common alternative implementations. My expectation it that they will be much slower than buffers, slightly slower than strict bytestrings, and faster than lazy bytestrings. One of the large advantages iteratees have over lazy IO is that space use is very predictable. While exact numbers depend on the enumerator and iteratee, they are typically small and constant. For example, enumFile uses a 4096-byte buffer which is copied to a ByteString[1], so "cat" will use only about 10 KiB for a file copy. enumHandle lets this value be tuned, depending on whether you'd like smaller space use or fewer buffer reads. [1] I don't know why this is done -- the reuse buffer/copy idiom is present in Oleg's code, but I suspect just using B.hGet will be more efficient. I'll do some benchmarks to confirm.
ListLike is possibly nice, but in the type indexed iteratee implementation that I started (but could not finish due to some issues with the type indexing) I didn't use it. ListLike doesn't support type threaded lists at all. On a side note, in my type threaded iteratee library, I initially elided StreamChunk but later added something similar in because I found it useful. I can't recall of the top of my head what the reasoning was, but I could dig deeper if it interests you. I was also following a fairly faithful re-implementation of John Lato's implementation, just with type indexing. I should probably post my partial library regardless. Perhaps others can find ways around the bits I was stuck on.
If you can recall the reasoning behind using ListLike or StreamChunk, it would be useful. Their advantages over simply using lists is not obvious to me.
I can see seeking as being important as your library moves into new domains of use. Particularly when reading large binary streams when the data is sparse.
Though I don't have any personal experience writing Haskell parsers for sparsely-populated files, I suspect that folds are poorly adapted to seeking. It will probably be more efficient to implement your own enumerator or enumeratee, which contains logic for skipping uninteresting portions of the file.

John Millikin wrote:
On Wed, Aug 18, 2010 at 23:33, Jason Dagit
wrote: The main reason I would use iteratees is for performance reasons. To help me, as a potential consumer of your library, could you please provide benchmarks for comparing the performance of enumerator with say, a) iteratee, b) lazy/strict bytestring, and c) Prelude functions? I'm interested in both max memory consumption and run-times. Using criterion and/or progression to get the run-times would be icing on an already delicious cake!
Oleg has some benchmarks of his implementation at < http://okmij.org/ftp/Haskell/Iteratee/Lazy-vs-correct.txt >, which clock iteratees at about twice as fast as lazy IO. He also compares them to a native "wc", but his comparison is flawed, because he's comparing a String iteratee vs byte-based wc.
I was under the impression Jason was asking about the performance of the iteratee package vs the enumerator package. I'd certainly be interested in seeing that. Right now I'm using attoparsec-iteratee, but if I could implement an attoparsec-enumerator which has the same/better performance, then I might switch over. So far I've been very pleased with John Lato's work, quality-wise. Reducing dependencies is nice, but my main concern is the lack of documentation. I know the ideas behind iteratee and have read numerous tutorials on various people's simplified versions. However, because the iteratee package uses somewhat different terminology and types, it's not always clear exactly how to translate my knowledge into being able to use the library effectively. The enumerator package seems to have fixed this :) -- Live well, ~wren

On Thu, Aug 19, 2010 at 14:29, wren ng thornton
I was under the impression Jason was asking about the performance of the iteratee package vs the enumerator package. I'd certainly be interested in seeing that. Right now I'm using attoparsec-iteratee, but if I could implement an attoparsec-enumerator which has the same/better performance, then I might switch over.
Oh, sorry -- both packages have the same performance. At least, if there is a difference, it's less than the margin of error on my benchmark (counting lines in the ubuntu 10.04 ISO, with cleared filesystem caches). Here's my Iteratee benchmark. I think this is the proper way to implement "wc -l", but if you see any errors in it which could cause poor performance, please let me know. -------------------------------------------------------------------------------- import qualified Data.ByteString.Char8 as B iterLines :: Monad m => IterateeG WrappedByteString Word8 m Integer iterLines = IterateeG (step 0) where step acc s@(EOF _) = return $ Done acc s step acc (Chunk wrapped) = return $ Cont (IterateeG (step acc')) Nothing where acc' = acc + countChar '\n' (unWrap wrapped) countChar :: Char -> B.ByteString -> Integer countChar c = B.foldl (\acc c' -> if c' == c then acc + 1 else acc) 0 -------------------------------------------------------------------------------- And here's typical times for various implementations -- numbers are real / user / sys, as reported by "time". They're mostly as expected, except (to my surprise) lazy bytestrings are as fast as strict bytestrings: wc -l ==================== 5.451 / 0.030 / 0.190 5.426 / 0.060 / 0.150 5.466 / 0.130 / 0.200 enumerator ==================== 8.235 / 5.270 / 1.010 8.278 / 5.270 / 0.880 8.264 / 5.370 / 0.860 iteratee ==================== 8.239 / 5.270 / 0.980 8.255 / 5.320 / 0.790 8.265 / 5.140 / 0.900 strict bytestrings ==================== 5.425 / 2.030 / 0.360 5.402 / 2.180 / 0.330 5.446 / 2.240 / 0.400 lazy bytestrings ==================== 5.467 / 1.910 / 0.260 5.428 / 1.990 / 0.280 5.433 / 2.140 / 0.190
So far I've been very pleased with John Lato's work, quality-wise. Reducing dependencies is nice, but my main concern is the lack of documentation. I know the ideas behind iteratee and have read numerous tutorials on various people's simplified versions. However, because the iteratee package uses somewhat different terminology and types, it's not always clear exactly how to translate my knowledge into being able to use the library effectively. The enumerator package seems to have fixed this :)
Glad to hear it. My goal is not to supplant "iteratee", but to supplement it -- if enumerator becomes the simple/learning version, and most major packages use "iteratee", that's fine.

Hurr durr -- I was wondering why the left-fold numbers looked so slow, then after looking at memory profiles I realized I forgot to add strictness to the enumerator/iteratee benchmarks. Here are the corrected numbers (with benchmarks attached): enumerator ==================== 5.414 / 2.090 / 0.360 5.415 / 1.950 / 0.320 5.429 / 2.010 / 0.360 iteratee ==================== 5.447 / 1.620 / 0.470 5.389 / 2.030 / 0.300 5.457 / 1.960 / 0.390

On 20 August 2010 06:29, wren ng thornton
John Millikin wrote:
On Wed, Aug 18, 2010 at 23:33, Jason Dagit
wrote: The main reason I would use iteratees is for performance reasons. To help me, as a potential consumer of your library, could you please provide benchmarks for comparing the performance of enumerator with say, a) iteratee, b) lazy/strict bytestring, and c) Prelude functions? I'm interested in both max memory consumption and run-times. Using criterion and/or progression to get the run-times would be icing on an already delicious cake!
Oleg has some benchmarks of his implementation at < http://okmij.org/ftp/Haskell/Iteratee/Lazy-vs-correct.txt >, which clock iteratees at about twice as fast as lazy IO. He also compares them to a native "wc", but his comparison is flawed, because he's comparing a String iteratee vs byte-based wc.
I was under the impression Jason was asking about the performance of the iteratee package vs the enumerator package. I'd certainly be interested in seeing that. Right now I'm using attoparsec-iteratee, but if I could implement an attoparsec-enumerator which has the same/better performance, then I might switch over.
So far I've been very pleased with John Lato's work, quality-wise. Reducing dependencies is nice, but my main concern is the lack of documentation. I know the ideas behind iteratee and have read numerous tutorials on various people's simplified versions. However, because the iteratee package uses somewhat different terminology and types, it's not always clear exactly how to translate my knowledge into being able to use the library effectively. The enumerator package seems to have fixed this :)
To be fair, John Lato's in-development branch of iteratee also fixes the naming problem (ie. is closer to Oleg's original naming for Iteratees, Enumerators and Enumeratees). I've been developing applications using iteratee for the past few weeks. Considering documentation, I don't think there is a lack of published characters on the topic. Oleg's series of emails introducing Iteratee and John Lato's article in the Monad.Reader were useful. John Millikin's documentation for enumerator is a welcome addition. However there is a deeper issue that Iteratees are semantically complex, and that complexity is not really addressed by the existing documentation: it mostly covers the various APIs, the design motivation (an extension of the left-fold enumerator), and evangelism (comparisons to lazy IO). I found it difficult to grok the reasons for the types, and what the operational control flow is (eg. how and why does EOF get propagated, how is a seek request communicated etc.). In general there seems to be a lot of interest in Iteratees recently as a way of dealing with resource management in IO. It's great to have a few different implementations to compare, but once performance is benchmarked and semantics are denotated it would be nice to converge on a single implementation and build a platform of libraries on it (for compression etc.), as was done for Lazy ByteString. cheers, Conrad.

John Millikin wrote:
If you can recall the reasoning behind using ListLike or StreamChunk, it would be useful. Their advantages over simply using lists is not obvious to me.
Well, one benefit is for efficient use of ByteStrings. Because the StreamChunk/ListLike class allows for non-parametric containers, you can use the fact that ByteString is a container of Word8 without unpacking it into a [Word8] first. Unpacking ByteStrings is horribly inefficient and is almost never necessary. As for why ListLike and StreamChunk are separate classes instead of just having "class StreamChunk full elem | full -> elem", ...that I don't know. Personally, I dislike ListLike because it has far too many methods. -- Live well, ~wren

On 19/08/2010 18:21, John Millikin wrote:
On Wed, Aug 18, 2010 at 23:33, Jason Dagit
wrote: The main reason I would use iteratees is for performance reasons. To help me, as a potential consumer of your library, could you please provide benchmarks for comparing the performance of enumerator with say, a) iteratee, b) lazy/strict bytestring, and c) Prelude functions? I'm interested in both max memory consumption and run-times. Using criterion and/or progression to get the run-times would be icing on an already delicious cake!
Oleg has some benchmarks of his implementation at< http://okmij.org/ftp/Haskell/Iteratee/Lazy-vs-correct.txt>, which clock iteratees at about twice as fast as lazy IO. He also compares them to a native "wc", but his comparison is flawed, because he's comparing a String iteratee vs byte-based wc.
Handle IO is also doing Unicode encoding/decoding, which iteratees bypass. Have you thought about how to incorporate encoding/decoding? Cheers, Simon

On Fri, Aug 20, 2010 at 04:01, Simon Marlow
Handle IO is also doing Unicode encoding/decoding, which iteratees bypass. Have you thought about how to incorporate encoding/decoding?
Yes; there will be a module Data.Enumerator.Text which contains locale-based IO, enumeratee-based encoding/decoding, and so forth. Since "iteratee" doesn't have any text-based IO, I figured it wasn't necessary for a first release; getting feedback on the basic soundness of the package was more important. Currently, I'm planning on the following type signatures for D.E.Text. 'enumHandle' will use Text's hGetLine, since there doesn't seem to be any text-based equivalent to ByteString's 'hGet'. -------------------------------------------------------------------------------- enumHandle :: Handle -> Enumerator SomeException Text IO b enumFile :: FilePath -> Enumerator SomeException Text IO b data Codec = Codec { codecName :: Text , codecEncode :: Text -> Either SomeException ByteString , codecDecode :: ByteString -> Either SomeException (Text, ByteString) } encode :: Codec -> Enumeratee SomeException Text ByteString m b decode :: Codec -> Enumeratee SomeException ByteString Text m b utf8 :: Codec utf16le :: Codec utf16be :: Codec utf32le :: Codec utf32be :: Codec ascii :: Codec iso8859_1 :: Codec --------------------------------------------------------------------------------

On Fri, Aug 20, 2010 at 12:51 PM, John Millikin
Currently, I'm planning on the following type signatures for D.E.Text. 'enumHandle' will use Text's hGetLine, since there doesn't seem to be any text-based equivalent to ByteString's 'hGet'.
CC'ing text's maintainer. Using 'hGetLine' will cause baaad surprises when you process a 10 GiB file with no '\n' in sight. Cheers! =) -- Felipe.

On Fri, Aug 20, 2010 at 08:59, Felipe Lessa
On Fri, Aug 20, 2010 at 12:51 PM, John Millikin
wrote: Currently, I'm planning on the following type signatures for D.E.Text. 'enumHandle' will use Text's hGetLine, since there doesn't seem to be any text-based equivalent to ByteString's 'hGet'.
CC'ing text's maintainer. Using 'hGetLine' will cause baaad surprises when you process a 10 GiB file with no '\n' in sight.
This thought occurred to me, but really, how often are you going to have a 10 GiB **text** file with no newlines? Remember, this is for text (log files, INI-style configs, plain .txt), not binary (HTML, XML, JSON). Off the top of my head, I can't think of any case where you'd expect to see 10 GiB in a single line. In the worst case, you can just use "decode" to process bytes coming from the ByteString-based enumHandle, which should give nicely chunked text.

On Fri, Aug 20, 2010 at 1:12 PM, John Millikin
This thought occurred to me, but really, how often are you going to have a 10 GiB **text** file with no newlines? Remember, this is for text (log files, INI-style configs, plain .txt), not binary (HTML, XML, JSON). Off the top of my head, I can't think of any case where you'd expect to see 10 GiB in a single line.
In the worst case, you can just use "decode" to process bytes coming from the ByteString-based enumHandle, which should give nicely chunked text.
I was thinking about an attacker, not a use case. Think of a web server accepting queries using iteratees internally. This may open door to at least DoS attacks. And then, we use iteratees because we don't like the unpredictability of lazy IO. Why should iteratees be unpredictable when dealing with Text? Besides the memory consumption problem, there may be performance problems if the lines are too short. Cheers! =) -- Felipe.

On Fri, Aug 20, 2010 at 09:30, Felipe Lessa
I was thinking about an attacker, not a use case. Think of a web server accepting queries using iteratees internally. This may open door to at least DoS attacks.
Web servers parse/generate HTTP, which is byte-based. They should be using the bytes-based handle enumerator.
And then, we use iteratees because we don't like the unpredictability of lazy IO. Why should iteratees be unpredictable when dealing with Text? Besides the memory consumption problem, there may be performance problems if the lines are too short.
If you don't want unpredictable performance, use bytes-based IO and decode it with "decode utf8" or something similar. Text-based IO merely exists to solve the most common case, which is a small file in local encoding with relatively short (< 200 char) lines. If you need to handle more complicated cases, such as: * Files in fixed or self-described encodings (JSON, XML) * Files with unknown encodings (HTML, RSS) * Files with content in multiple encodings (EMail) * Files containing potentially malicious input (such as public server log files) Then you need to read them as bytes and decide yourself which decoding is necessary.

On 20/08/10 17:30, Felipe Lessa wrote:
On Fri, Aug 20, 2010 at 1:12 PM, John Millikin
wrote: This thought occurred to me, but really, how often are you going to have a 10 GiB **text** file with no newlines? Remember, this is for text (log files, INI-style configs, plain .txt), not binary (HTML, XML, JSON). Off the top of my head, I can't think of any case where you'd expect to see 10 GiB in a single line.
In the worst case, you can just use "decode" to process bytes coming from the ByteString-based enumHandle, which should give nicely chunked text.
I was thinking about an attacker, not a use case. Think of a web server accepting queries using iteratees internally. This may open door to at least DoS attacks.
You don't need to send that much data, the current implementation of Enumerator uses hGet, which blocks, so just send the server a few bytes and it'll be sitting there waiting for input until it times out (if ever). Open a few hundred of those connections and you're likely to cause the server to run out of FDs. Of course this is already coded up in tools like slowloris[1] :-) /M [1] http://ha.ckers.org/slowloris/ -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus@therning.org http://therning.org/magnus identi.ca|twitter: magthe

On Fri, Aug 20, 2010 at 12:52, Magnus Therning
You don't need to send that much data, the current implementation of Enumerator uses hGet, which blocks, so just send the server a few bytes and it'll be sitting there waiting for input until it times out (if ever). Open a few hundred of those connections and you're likely to cause the server to run out of FDs. Of course this is already coded up in tools like slowloris[1] :-)
Correct me if I'm wrong, but I'm pretty sure changing the implementation to something non-blocking like hGetNonBlocking will not fix this. Hooking up an iteratee to an enumerator which doesn't block will cause it to loop forever, which is arguably worse than simply blocking. The best way I can think of to defeat a handle-exhaustion attack is to enforce a timeout on HTTP header parsing, using something like System.Timeout. This protects against slowloris, since requiring the entire header to be parsed within some fixed small period of time prevents the socket from being held open via slowly-trickled headers.

On 20/08/10 22:32, John Millikin wrote:
On Fri, Aug 20, 2010 at 12:52, Magnus Therning
wrote: You don't need to send that much data, the current implementation of Enumerator uses hGet, which blocks, so just send the server a few bytes and it'll be sitting there waiting for input until it times out (if ever). Open a few hundred of those connections and you're likely to cause the server to run out of FDs. Of course this is already coded up in tools like slowloris[1] :-)
Correct me if I'm wrong, but I'm pretty sure changing the implementation to something non-blocking like hGetNonBlocking will not fix this. Hooking up an iteratee to an enumerator which doesn't block will cause it to loop forever, which is arguably worse than simply blocking.
The best way I can think of to defeat a handle-exhaustion attack is to enforce a timeout on HTTP header parsing, using something like System.Timeout. This protects against slowloris, since requiring the entire header to be parsed within some fixed small period of time prevents the socket from being held open via slowly-trickled headers.
Indeed. In many protocols it would force the attacker to send well-formed requests though. I think this is true for many text-based protocols like HTTP. The looping can be handled effectively through hWaitForInput. There are also other reasons for doing non-blocking IO, not least that it makes developing and manual testing a lot nicer. /M -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus@therning.org http://therning.org/magnus identi.ca|twitter: magthe

On Fri, Aug 20, 2010 at 14:58, Magnus Therning
Indeed.
In many protocols it would force the attacker to send well-formed requests though. I think this is true for many text-based protocols like HTTP.
The looping can be handled effectively through hWaitForInput.
There are also other reasons for doing non-blocking IO, not least that it makes developing and manual testing a lot nicer.
I think I'm failing to understand something. Using a non-blocking read doesn't change how the iteratees react to well- or mal-formed requests. All it does is change the failure condition from "blocked indefinitely" to "looping indefinitely". Replacing the hGet with a combination of hWaitForInput / hGetNonBlocking would cause a third failure condition, "looping indefinitely with periodic blocks". This doesn't seem to be an improvement over simply blocking. Do you have any example code which works well using a non-blocking enumerator, but fails with a blocking one?

On 20/08/10 23:12, John Millikin wrote:
On Fri, Aug 20, 2010 at 14:58, Magnus Therning
wrote: Indeed.
In many protocols it would force the attacker to send well-formed requests though. I think this is true for many text-based protocols like HTTP.
The looping can be handled effectively through hWaitForInput.
There are also other reasons for doing non-blocking IO, not least that it makes developing and manual testing a lot nicer.
I think I'm failing to understand something.
Using a non-blocking read doesn't change how the iteratees react to well- or mal-formed requests. All it does is change the failure condition from "blocked indefinitely" to "looping indefinitely".
It changes the timing. The iteratee will receive the data sooner (when it's available rather than when the buffer is full). This means it can fail *sooner*, in wall-clock time.
Replacing the hGet with a combination of hWaitForInput / hGetNonBlocking would cause a third failure condition, "looping indefinitely with periodic blocks". This doesn't seem to be an improvement over simply blocking.
It is an improvement when data is trickling in. In other cases it's no improvement (besides that it'd be possible have time-outs on a "lower level").
Do you have any example code which works well using a non-blocking enumerator, but fails with a blocking one?
It's not about failing vs non-failing, it's about time of failure. An example would be failing after reading a few bytes (the verb of a HTTP request) vs failing after either reading 4k (which is the buffer size in iteratee, IIRC) or when the client hangs up. /M -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus@therning.org http://therning.org/magnus identi.ca|twitter: magthe

On Sat, Aug 21, 2010 at 5:40 AM, Magnus Therning
It changes the timing. The iteratee will receive the data sooner (when it's available rather than when the buffer is full). This means it can fail *sooner*, in wall-clock time.
I still fail to see how this works. So I went to see the sources. In [1] we can see how hGet and hGetNonBlocking are defined. The only difference is that the former uses hGetBuf, and the latter uses hGetBufNonBlocking. [1] http://hackage.haskell.org/packages/archive/bytestring/0.9.1.7/doc/html/src/... hGetBuf's main loop is bufRead [2], while hGetBufNonBlocking's main loop is bufReadNonBlocking [3]. Both are very similar. The main differences are RawIO.read vs RawIO.readNonBlocking [4], and Buffered.fillReadBuffer vs Buffered.fillReadBuffer0 [5]. Reading RawIO's documentation [4], we see that RawIO.read blocks only if there is no data available. So it doesn't wait for the buffer to be fully filled, it just "returns the available data". Unfortunately, BufferedIO's documentation [5] doesn't specify if Buffered.fillReadBuffer should return the available data without blocking. However, it does specify that that it should be "blocking if the are no bytes available". [2] http://hackage.haskell.org/packages/archive/base/4.2.0.1/doc/html/src/GHC-IO... [3] http://hackage.haskell.org/packages/archive/base/4.2.0.1/doc/html/src/GHC-IO... [4] http://hackage.haskell.org/packages/archive/base/4.2.0.1/doc/html/src/GHC-IO... [5] http://hackage.haskell.org/packages/archive/base/4.2.0.1/doc/html/src/GHC-IO... So, assuming that the semantics of BufferedIO are the same as RawIO's, *both* are non-blocking whenever data is already available. None of them wait until the buffer is full. The difference lies in whether they block if there is no data available. However, when there isn't data the enumarator *always* wants to block. So using non-blocking IO doesn't give anything, only complicates the code. Am I misreading the docs/source somewhere? =) Cheers! -- Felipe.

I think the docs are wrong, or perhaps we're misunderstanding them. Magnus is correct. Attached is a test program which listens on two ports, 42000 (blocking IO) and 42001 (non-blocking). You can use netcat, telnet, etc, to send it data. The behavior is as Magnus describes: bytes from hGetNonBlocking are available immediately, while hGet waits for a full buffer (or EOF) before returning. This behavior obviously makes hGet unsuitable for enumHandle; my apologies for not understanding the problem sooner.

John Millikin
I think the docs are wrong, or perhaps we're misunderstanding them. Magnus is correct.
Attached is a test program which listens on two ports, 42000 (blocking IO) and 42001 (non-blocking). You can use netcat, telnet, etc, to send it data. The behavior is as Magnus describes: bytes from hGetNonBlocking are available immediately, while hGet waits for a full buffer (or EOF) before returning.
"hSetBuffering handle NoBuffering"?
The implementation as it is is fine IMO.
G
--
Gregory Collins

On Sat, Aug 21, 2010 at 11:35, Gregory Collins
John Millikin
writes: I think the docs are wrong, or perhaps we're misunderstanding them. Magnus is correct.
Attached is a test program which listens on two ports, 42000 (blocking IO) and 42001 (non-blocking). You can use netcat, telnet, etc, to send it data. The behavior is as Magnus describes: bytes from hGetNonBlocking are available immediately, while hGet waits for a full buffer (or EOF) before returning.
"hSetBuffering handle NoBuffering"?
The implementation as it is is fine IMO.
Disabling buffering doesn't change the behavior -- hGet h 20 still doesn't return until the handle has at least 20 bytes of input available.

On Sat, Aug 21, 2010 at 10:58 AM, John Millikin
I think the docs are wrong, or perhaps we're misunderstanding them. Magnus is correct.
Attached is a test program which listens on two ports, 42000 (blocking IO) and 42001 (non-blocking). You can use netcat, telnet, etc, to send it data. The behavior is as Magnus describes: bytes from hGetNonBlocking are available immediately, while hGet waits for a full buffer (or EOF) before returning.
This behavior obviously makes hGet unsuitable for enumHandle; my apologies for not understanding the problem sooner.
You should note that in ghc>=6.12, hWaitForInput tries to decode the next character of input based on to the Handle's encoding. As a result, it will block if the next multibyte sequence is incomplete, and it will throw an error if a multibyte sequence gets split between two chunks. I worked around this problem in Haskeline by temporarily setting stdin to BinaryMode; you may want to do something similar. Also, this issue caused a bug in bytestring with ghc-6.12: http://hackage.haskell.org/trac/ghc/ticket/3808 which will be resolved by the new function 'hGetBufSome' (in ghc-6.14) that blocks only when there's no data to read: http://hackage.haskell.org/trac/ghc/ticket/4046 That function might be useful for your package, though not portable to other implementations or older GHC versions. Best, -Judah

On Sat, Aug 21, 2010 at 11:58, Judah Jacobson
You should note that in ghc>=6.12, hWaitForInput tries to decode the next character of input based on to the Handle's encoding. As a result, it will block if the next multibyte sequence is incomplete, and it will throw an error if a multibyte sequence gets split between two chunks.
I worked around this problem in Haskeline by temporarily setting stdin to BinaryMode; you may want to do something similar.
Also, this issue caused a bug in bytestring with ghc-6.12: http://hackage.haskell.org/trac/ghc/ticket/3808 which will be resolved by the new function 'hGetBufSome' (in ghc-6.14) that blocks only when there's no data to read: http://hackage.haskell.org/trac/ghc/ticket/4046 That function might be useful for your package, though not portable to other implementations or older GHC versions.
You should not be reading bytestrings from text-mode handles. The more I think about it, the more having a single Handle type for both text and binary data causes problems. There should be some separation so users don't accidentally use a text handle with binary functions, and vice-versa: openFile :: FilePath -> IOMode -> IO TextHandle openBinaryFile :: FIlePath -> IOMode -> IO BinaryHandle hGetBuf :: BinaryHandle -> Ptr a -> Int -> IO Int Data.ByteString.hGet :: BinaryHandle -> IO ByteString -- etc then the enumerators would simply require the correct handle type: Data.Enumerator.IO.enumHandle :: BinaryHandle -> Enumerator SomeException ByteString IO b Data.Enumerator.Text.enumHandle :: TextHandle -> Enumerator SomeException Text IO b I suppose the enumerators could verify the handle mode and throw an exception if it's incorrect -- at least that way, it will fail consistently rather than only in rare occasions.

On 21/08/10 18:58, John Millikin wrote:
I think the docs are wrong, or perhaps we're misunderstanding them. Magnus is correct.
Attached is a test program which listens on two ports, 42000 (blocking IO) and 42001 (non-blocking). You can use netcat, telnet, etc, to send it data. The behavior is as Magnus describes: bytes from hGetNonBlocking are available immediately, while hGet waits for a full buffer (or EOF) before returning.
This behavior obviously makes hGet unsuitable for enumHandle; my apologies for not understanding the problem sooner.
Thanks, but I suspect that it was my bad description of the issue that made understanding the issue more problematic. Anyway it's good we now understand each other, and even better that we agree :-) As an aside, has anyone written the code necessary to convert a parser, such as e.g. attoparsec, into an enumerator-iteratee[1]? /M [1] Similar to how attoparsec-iteratee does it for iteratee-iteratee. -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus@therning.org http://therning.org/magnus identi.ca|twitter: magthe

On Sat, Aug 21, 2010 at 12:44, Magnus Therning
As an aside, has anyone written the code necessary to convert a parser, such as e.g. attoparsec, into an enumerator-iteratee[1]?
This sort of conversion is trivial. For an example, I've uploaded the attoparsec-enumerator package at < http://hackage.haskell.org/package/attoparsec-enumerator > -- iterParser is about 20 lines, excluding the module header and imports.

On Sat, Aug 21, 2010 at 3:36 PM, John Millikin
This sort of conversion is trivial. For an example, I've uploaded the attoparsec-enumerator package at < http://hackage.haskell.org/package/attoparsec-enumerator > -- iterParser is about 20 lines, excluding the module header and imports.
Cool, but is there a reason it won't work with version 0.2 you just released? build-depends: [...] , enumerator >= 0.1 && < 0.2 I noticed that when installing it. Paulo

On Sat, Aug 21, 2010 at 14:17, Paulo Tanimoto
Cool, but is there a reason it won't work with version 0.2 you just released?
build-depends: [...] , enumerator >= 0.1 && < 0.2
I noticed that when installing it.
Hah ... forgot to save the vim buffer. Corrected version uploaded. Sorry about that.

On Sat, 21 Aug 2010 13:36:08 -0700, John Millikin
On Sat, Aug 21, 2010 at 12:44, Magnus Therning
wrote: As an aside, has anyone written the code necessary to convert a parser, such as e.g. attoparsec, into an enumerator-iteratee[1]?
This sort of conversion is trivial. For an example, I've uploaded the attoparsec-enumerator package at < http://hackage.haskell.org/package/attoparsec-enumerator > -- iterParser is about 20 lines, excluding the module header and imports.
< A.Done extra a -> E.yield a (E.Chunks [extra]) Maybe it would be better to check if extra is empty to produce an empty list of chunks? -- Nicolas Pouillard http://nicolaspouillard.fr

On Aug 21, 2010, at 4:12 AM, John Millikin wrote:
This thought occurred to me, but really, how often are you going to have a 10 GiB **text** file with no newlines?
When you have a file developed on a system that follows a different new-line convention. I haven't seen a file that big, but I'm sadly used to seeing humanly large files display as single lines. Of course if getLine/hGetLine accept *any* of CR, LF, CR+LF as end-of-line (as opposed to using the platform native convention), there's no problem. That's a darned good idea anyway.

Very nice. It looks a lot like the iteratee replacement I've been working on for an as-of-yet unreleased project of mine. The main differences: - my chunks only have a single ByteString, not a list of them (I was experimenting with a list at a point where I thought it'd be useful to be able to insert stuff into the stream, but abandoned it for the moment) - my enumHandle uses hGetNonBlocking to make it easier to deal with early termination when reading data off the network - I have a set of unit tests that cover the implementation fairly well My reasons for writing my own was basically the same as yours. I'll definitely have a look at ditching my implementation as some point, in favour of yours :-) /M -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus@therning.org http://therning.org/magnus identi.ca|twitter: magthe

Just released version 0.1.1, which includes: * Michael Snoyman's improved 'consume' * (>==>) and (<==<) operators, for composing enumerators (sort of like (>=>) for monads) * ($$) operator, an alias for (==<<), which matches Oleg's operator and makes reading 'run' statements a bit easier. * catchError, for handling errors * liftTrans, for lifting an iteratee's inner monad to a monad transformer * liftFoldL and liftFoldL', for building iteratees out of pure left folds. An example of their use is 'iterLines' in Examples/wc.hs * liftFoldM, like liftFoldL but for monadic left folds

Just released 0.2. It has the text IO and codecs module, with support for ASCII, ISO-8859-1, UTF-8, UTF-16, and UTF-32. It should be relatively easy to add support for codec libraries like libicu or libiconv in the future. Both encoding and decoding are incremental, so you can (for example) process million-line logfiles in constant space. Examples/wc.hs has been updated to use this decoding module for its "character count" mode, which should allow users to see how it's used. Basically, you use 'joinI' to flatten the iteratees returned from enumeratees. The joinI / enumeratee style is used for implementing nested streams. This also changes the binary enumHandle to use non-blocking IO, as recommended by Magnus Therning. I'm embarrassed to admit I still don't understand the improvement, exactly, but three people so far have told me it's a good idea. As always, API docs and a literate PDF are available at: http://ianen.org/haskell/enumerator/api-docs/ http://ianen.org/haskell/enumerator/enumerator.pdf

Hi John, What do you think of putting those parsing functions like head, last, length, etc, under another module or, alternatively, putting the main definitions under another module (say, Base or Core)? I wouldn't mind if they all get re-exported. I say that because since the library aims to be minimalistic, it would be nice to import the core parts only. That makes it easy to avoid some name clashes as well. Take care, Paulo

On Fri, Aug 20, 2010 at 21:02, Paulo Tanimoto
Hi John,
What do you think of putting those parsing functions like head, last, length, etc, under another module or, alternatively, putting the main definitions under another module (say, Base or Core)? I wouldn't mind if they all get re-exported.
I say that because since the library aims to be minimalistic, it would be nice to import the core parts only. That makes it easy to avoid some name clashes as well.
My goal isn't to be "minimalistic", necessarily, I just don't want to drag in huge dependencies like haskell98. Ideally, the API would be something like that of "bytestring" or "text" -- easy to understand, comprehensive, and with large dependencies factored out to related modules (like "text-icu" or "bytestring-mmap"). Having lots (and lots and lots) of exports is OK, as long as it's easy for users to understand how they work. Regarding my comment for the parsing functions: I'm starting to think that's wrong. They're not for "parsing" so much as general data manipulation. Especially cases like dropWhile or peek, which are useful to all sorts of data types which can't really be "parsed". The next release will probably see an expansion and documentation of that section, though some of the more useless ones (length, last) will be removed unless anybody speaks up.

On Sat, Aug 21, 2010 at 12:30 AM, John Millikin
Just released 0.2. It has the text IO and codecs module, with support for ASCII, ISO-8859-1, UTF-8, UTF-16, and UTF-32. It should be relatively easy to add support for codec libraries like libicu or libiconv in the future. Both encoding and decoding are incremental, so you can (for example) process million-line logfiles in constant space.
I think it would be nice to say in the docs that a constant sized buffer isn't used. Alas, Data.Text.IO.hGetLine internally uses Data.Text.concat. This means that you need to do an additional copy whenever a newline is not found in the first buffer. So there's a performance reason to have an hGet as well =).
This also changes the binary enumHandle to use non-blocking IO, as recommended by Magnus Therning. I'm embarrassed to admit I still don't understand the improvement, exactly, but three people so far have told me it's a good idea.
Me neither =). Cheers! -- Felipe.

On 21/08/2010 04:30, John Millikin wrote:
This also changes the binary enumHandle to use non-blocking IO, as recommended by Magnus Therning. I'm embarrassed to admit I still don't understand the improvement, exactly, but three people so far have told me it's a good idea.
The issue is that hGet always waits for a complete buffer-full of data before returning. The hWaitForInput/hGetNonBlocking combination fixes that problem, but you have to be careful to make sure that the Handle is in binary mode, otherwise hWaitForInput will not behave the way you expect (it will decode the input byte stream, and wait for a full character). For more information, see http://hackage.haskell.org/trac/ghc/ticket/3808 A better fix is to use hGetBufSome, but (a) it is only available in GHC 6.14 which isn't released yet, and (b) there isn't a bytestring wrapper for it yet. Cheers, Simon

Hello, Simon!
On Mon, Aug 23, 2010 at 8:00 AM, Simon Marlow
The issue is that hGet always waits for a complete buffer-full of data before returning. The hWaitForInput/hGetNonBlocking combination fixes that problem, but you have to be careful to make sure that the Handle is in binary mode, otherwise hWaitForInput will not behave the way you expect (it will decode the input byte stream, and wait for a full character). For more information, see
http://hackage.haskell.org/trac/ghc/ticket/3808
A better fix is to use hGetBufSome, but (a) it is only available in GHC 6.14 which isn't released yet, and (b) there isn't a bytestring wrapper for it yet.
So there really is a problem in the documentation of hGetBuf. I assume it got fixed in HEAD together with hGetBufSome. Cheers! =) -- Felipe.

On 23/08/2010 12:10, Felipe Lessa wrote:
Hello, Simon!
On Mon, Aug 23, 2010 at 8:00 AM, Simon Marlow
wrote: The issue is that hGet always waits for a complete buffer-full of data before returning. The hWaitForInput/hGetNonBlocking combination fixes that problem, but you have to be careful to make sure that the Handle is in binary mode, otherwise hWaitForInput will not behave the way you expect (it will decode the input byte stream, and wait for a full character). For more information, see
http://hackage.haskell.org/trac/ghc/ticket/3808
A better fix is to use hGetBufSome, but (a) it is only available in GHC 6.14 which isn't released yet, and (b) there isn't a bytestring wrapper for it yet.
So there really is a problem in the documentation of hGetBuf. I assume it got fixed in HEAD together with hGetBufSome.
Which documentation are you referring to? This looks ok to me: http://www.haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO... Cheers, Simon

On Mon, Aug 23, 2010 at 8:29 AM, Simon Marlow
Which documentation are you referring to? This looks ok to me:
http://www.haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO...
Indeed, while there isn't a big fat warning, it does say that it reads 'count' bytes. However, both RawIO.read and BufferedIO.fillReadBuffer are a bit misleading. The former says that it doesn't block when there isn't data available, the latter doesn't say anything. http://hackage.haskell.org/packages/archive/base/4.2.0.1/doc/html/src/GHC-IO... http://hackage.haskell.org/packages/archive/base/4.2.0.1/doc/html/src/GHC-IO... Cheers! =) -- Felipe.

On 23/08/2010 12:38, Felipe Lessa wrote:
On Mon, Aug 23, 2010 at 8:29 AM, Simon Marlow
wrote: Which documentation are you referring to? This looks ok to me:
http://www.haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/System-IO...
Indeed, while there isn't a big fat warning, it does say that it reads 'count' bytes.
However, both RawIO.read and BufferedIO.fillReadBuffer are a bit misleading. The former says that it doesn't block when there isn't data available, the latter doesn't say anything.
http://hackage.haskell.org/packages/archive/base/4.2.0.1/doc/html/src/GHC-IO... http://hackage.haskell.org/packages/archive/base/4.2.0.1/doc/html/src/GHC-IO...
Hmm, RawIO.read looks ok: -- | Read up to the specified number of bytes, returning the number -- of bytes actually read. This function should only block if there -- is no data available. If there is not enough data available, -- then the function should just return the available data. A return -- value of zero indicates that the end of the data stream (e.g. end -- of file) has been reached. that seems pretty clear to me. No? I'll expand the documentation for fillReadBuffer. Cheers, Simon

On Mon, Aug 23, 2010 at 8:51 AM, Simon Marlow
Hmm, RawIO.read looks ok:
-- | Read up to the specified number of bytes, returning the number -- of bytes actually read. This function should only block if there -- is no data available. If there is not enough data available, -- then the function should just return the available data. A return -- value of zero indicates that the end of the data stream (e.g. end -- of file) has been reached.
that seems pretty clear to me. No?
It says that it "should only block if there is no data available". I assumed that fillReadBuffer has the same semantics. If both do not block if there is data, then hGetBuf would not wait for the buffer to be filled, if I am reading its source correctly [1]. Either they do block until the buffer is filled, or I'm misreading hGetBuf/bufRead. =) [1] http://www.haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/src/GHC-I... Cheers! =) -- Felipe.

On 23/08/2010 12:57, Felipe Lessa wrote:
On Mon, Aug 23, 2010 at 8:51 AM, Simon Marlow
wrote: Hmm, RawIO.read looks ok:
-- | Read up to the specified number of bytes, returning the number -- of bytes actually read. This function should only block if there -- is no data available. If there is not enough data available, -- then the function should just return the available data. A return -- value of zero indicates that the end of the data stream (e.g. end -- of file) has been reached.
that seems pretty clear to me. No?
It says that it "should only block if there is no data available". I assumed that fillReadBuffer has the same semantics. If both do not block if there is data, then hGetBuf would not wait for the buffer to be filled, if I am reading its source correctly [1]. Either they do block until the buffer is filled, or I'm misreading hGetBuf/bufRead. =)
I think it's the latter. bufRead loops until it has read the full amount of data requested, or EOF is reached. Cheers, Simon
[1] http://www.haskell.org/ghc/docs/6.12.2/html/libraries/base-4.2.0.1/src/GHC-I...
Cheers! =)

On Mon, Aug 23, 2010 at 9:00 AM, Simon Marlow
I think it's the latter. bufRead loops until it has read the full amount of data requested, or EOF is reached.
Hmmm... sorry about the noise then =). Cheers, -- Felipe.
participants (14)
-
Conrad Parker
-
Felipe Lessa
-
Gregory Collins
-
Ivan Lazar Miljenovic
-
Jason Dagit
-
John Millikin
-
Judah Jacobson
-
Magnus Therning
-
Michael Snoyman
-
Nicolas Pouillard
-
Paulo Tanimoto
-
Richard O'Keefe
-
Simon Marlow
-
wren ng thornton