
On Wed, 17 Sep 2008, Mitchell, Neil wrote:
I tend to use openFile, hGetContents, hClose - your initial readFile like call should be openFile/hGetContents, which gives you a lazy stream, and on a parse error call hClose.
I could use a function like withReadFile :: FilePath -> (Handle -> IO a) -> IO a withReadFile name action = bracket openFile hClose ...
Then, if 'action' fails, the file can be properly closed. However, there is still a problem: Say, 'action' is a parser which produces a data structure lazily. Then further processing of that data structure of type 'a' may again stop before completing the whole structure, which would also leave the file open. We have to force users to do all processing within 'action' and to only return strict values. But how to do this?
I used rnf from Control.Parallel.Strategies when dealing with a similar problem. Would it work in your case? To merge discussion from a related thread: IMO, the question is how much should a language/library prevent the user from shooting himself in the foot? The biggest problem with lazy IO, IMO, is that it presents several opportunities to do so. The three biggest causes I've dealt with are handle leaks, insufficiently evaluated data structures, and problems with garbage collection as in the naive 'mean xs = sum xs / length xs' implementation. There are some idioms that can help with the first two cases, namely the 'with*' paradigm and 'rnf', but the third problem requires that the programmer know how stuff works to avoid poor implementations. While that's not bad per se, in some cases I think it's far too easy for the unwitting, or even the slightly distracted, to get caught in traps. I'm facing a design decision ATM related to this. I can use something like lazy bytestrings, in which the chunkiness and laziness is reified into the datastructure, or an Iterator-style fold for consuming data. The advantage of the former approach is that it's well understood by most users and has proven good performance, while on the downside I could see it easily leading to memory exhaustion. I think the problem with lazy bytestrings, in particular, is that the foldChunks is so well hidden from most consumers that it's easy to hold references that prevent consumed chunks from being reclaimed by the GC. When dealing with data in the hundreds of MBs, or GB range, this is a problem. An Enumerator, on the other hand, makes the fold explicit, so users are required to think about the best way to consume data. It's much harder to unintentionally hold references. This is quite appealing. Based on my own tests so far performance is certainly competitive. Assuming a good implementation, handle leaks can also be prevented. On the downside, it's a very poor model if random access is required, and users aren't as familiar with it, in addition to some of the questions Don raises. Back onto the topic at hand - 'action' could be a parser passed into an enumerator. Since it would read strictly, the action could end the read when it has enough data. That's another approach that I think would work with your problem. Well, that's my 2cents. John Lato

jwlato:
On Wed, 17 Sep 2008, Mitchell, Neil wrote:
I tend to use openFile, hGetContents, hClose - your initial readFile like call should be openFile/hGetContents, which gives you a lazy stream, and on a parse error call hClose.
I could use a function like withReadFile :: FilePath -> (Handle -> IO a) -> IO a withReadFile name action = bracket openFile hClose ...
Then, if 'action' fails, the file can be properly closed. However, there is still a problem: Say, 'action' is a parser which produces a data structure lazily. Then further processing of that data structure of type 'a' may again stop before completing the whole structure, which would also leave the file open. We have to force users to do all processing within 'action' and to only return strict values. But how to do this?
I used rnf from Control.Parallel.Strategies when dealing with a similar problem. Would it work in your case?
To merge discussion from a related thread:
IMO, the question is how much should a language/library prevent the user from shooting himself in the foot? The biggest problem with lazy IO, IMO, is that it presents several opportunities to do so. The three biggest causes I've dealt with are handle leaks, insufficiently evaluated data structures, and problems with garbage collection as in the naive 'mean xs = sum xs / length xs' implementation.
There are some idioms that can help with the first two cases, namely the 'with*' paradigm and 'rnf', but the third problem requires that the programmer know how stuff works to avoid poor implementations. While that's not bad per se, in some cases I think it's far too easy for the unwitting, or even the slightly distracted, to get caught in traps.
I'm facing a design decision ATM related to this. I can use something like lazy bytestrings, in which the chunkiness and laziness is reified into the datastructure, or an Iterator-style fold for consuming data. The advantage of the former approach is that it's well understood by most users and has proven good performance, while on the downside I could see it easily leading to memory exhaustion. I think the problem with lazy bytestrings, in particular, is that the foldChunks is so well hidden from most consumers that it's easy to hold references that prevent consumed chunks from being reclaimed by the GC. When dealing with data in the hundreds of MBs, or GB range, this is a problem.
An Enumerator, on the other hand, makes the fold explicit, so users are required to think about the best way to consume data. It's much harder to unintentionally hold references. This is quite appealing. Based on my own tests so far performance is certainly competitive. Assuming a good implementation, handle leaks can also be prevented. On the downside, it's a very poor model if random access is required, and users aren't as familiar with it, in addition to some of the questions Don raises.
Yes, I'm certain we can reach the performance of, or outperform, lazy (cache-sized chunk) bytestrings using enumerators on chunks, but the model is somewhat unfamiliar. Structuring the api such that people can write programs in this style will be the challenge. -- Don
participants (2)
-
Don Stewart
-
John Lato