
On 22 April 2005 21:56, Glynn Clements wrote:
David Brown wrote:
And hGet/PutWord8 are fast enough for most situations.
Are you certain? Which interesting applications do you know that read and write one byte at a time? I can't speak for "most" situations, but in those situations where I needed binary I/O this API would have been impracticable.
I'm going to chime in agreement with the disagreement.
I have one application which uses single character binary I/O. Because it works on small files, it borders on usefuleness. As time goes on, the size of these inputs grows, and now the tool has become nearly worthless.
I seem to run into this kind of thing a lot with Haskell. Dominic's ASN library is useful to me, however I'm finding I'll probably just have to use it as a template for new code. It works fine for what designed for, parsing keys and such, but I'm looking to use it to represent gigabytes of data. Processing data like that one Word8 at a time isn't practical.
Personally, I doubt that Haskell will ever be practical for processing very large amounts of data (e.g. larger than your system's RAM).
I hope that's not true - whatever techniques you use in other languages for handling large amounts of data should translate straightforwardly into Haskell. Cheers, SImon

On Fri, 2005-04-22 at 22:07 +0100, Simon Marlow wrote:
Personally, I doubt that Haskell will ever be practical for processing very large amounts of data (e.g. larger than your system's RAM).
I hope that's not true - whatever techniques you use in other languages for handling large amounts of data should translate straightforwardly into Haskell.
Though arn't there some issues with the fact that regular garbage colection touches most of the heap (even if it doesn't modify it) and so very little of it can be paged out of physical ram. Of course we can use more heavyweight techniques like maintining serialised data in large mmaped areas etc. Duncan

On Apr 22, 2005, at 5:33 PM, Duncan Coutts wrote:
Though arn't there some issues with the fact that regular garbage colection touches most of the heap (even if it doesn't modify it) and so very little of it can be paged out of physical ram.
This is a common misconception about garbage collection in general. There are only two reasons for a garbage collector to walk through a given piece of memory: * The memory is live, and may contain pointers; those pointers must be found and traced. * A copying/compacting collector needs to move the data. Most collectors keep a special large object area which contains big arrays. Even if copying collection is used for other objects, these large objects never move. Furthermore, if an array contains no pointers (because, for example, it's a byte array read from a file) it does not need to be scanned by the garbage collector. So I wouldn't worry about having your huge binary objects walked by the garbage collector. Whatever GC may do to a heap chock-full of tiny objects, a single large pointer-free object should be left alone. -Jan-Willem Maessen

On Sat, 2005-04-23 at 14:10 -0400, Jan-Willem Maessen wrote:
On Apr 22, 2005, at 5:33 PM, Duncan Coutts wrote:
Though arn't there some issues with the fact that regular garbage colection touches most of the heap (even if it doesn't modify it) and so very little of it can be paged out of physical ram.
This is a common misconception about garbage collection in general.
There are only two reasons for a garbage collector to walk through a given piece of memory: * The memory is live, and may contain pointers; those pointers must be found and traced. * A copying/compacting collector needs to move the data.
Most collectors keep a special large object area which contains big arrays. Even if copying collection is used for other objects, these large objects never move.
Yes, indeed.
Furthermore, if an array contains no pointers (because, for example, it's a byte array read from a file) it does not need to be scanned by the garbage collector.
Like these unboxed array types.
So I wouldn't worry about having your huge binary objects walked by the garbage collector. Whatever GC may do to a heap chock-full of tiny objects, a single large pointer-free object should be left alone.
Sadly the case I had in mind is exactly the former, of large syntax trees and large symbol tables. About 400Mb of seldom accessed mostly read-only and yet unpagable data. Then to makes things worse we've got some nasty little piece of code which walks the AST and for some inexplicable reason generates vast amounts of garbage. To make things work on normal machines we have to set the heap limit as low as possible and so the garbage collector has to run very frequently reclaiming very little each time and yet it has to touch all of the rest of the 400Mb dataset which prevents it being paged out. My tests indicate that 3/4 of the running time is spent doing GC. </grumble> :-) Duncan

On Apr 23, 2005, at 2:54 PM, Duncan Coutts wrote:
On Sat, 2005-04-23 at 14:10 -0400, Jan-Willem Maessen wrote:
So I wouldn't worry about having your huge binary objects walked by the garbage collector. Whatever GC may do to a heap chock-full of tiny objects, a single large pointer-free object should be left alone.
Sadly the case I had in mind is exactly the former, of large syntax trees and large symbol tables. About 400Mb of seldom accessed mostly read-only and yet unpagable data.
Ah. Now that's another kettle of fish entirely... However, generational GC *ought* to help here. If you're using GHC, I assume you've turned on compacting GC to avoid doubling your memory, and have set an appropriate upper bound on the heap size.
Then to makes things worse we've got some nasty little piece of code which walks the AST and for some inexplicable reason generates vast amounts of garbage. To make things work on normal machines we have to set the heap limit as low as possible and so the garbage collector has to run very frequently reclaiming very little each time and yet it has to touch all of the rest of the 400Mb dataset which prevents it being paged out. My tests indicate that 3/4 of the running time is spent doing GC. </grumble> :-)
Hmm; this sounds like a lot of full-heap collections, which is exactly what generational GC is trying to avoid. A very large old generation (like, say, 500+Mb) might help a lot in this instance; I have no idea how GHC decides generation sizes. It might also help to set a very large allocation area to reduce promotion rate to the second generation, and give the gobs of transient data some time to die---or, similarly, to increase the number of generations to increase the time it takes things to get to the old generation. Fundamentally, though, when you run really close to your memory limits GC tends to be unhappy. -Jan-Willem Maessen
Duncan

Hello, I am just going to summarize what we are proposing so far:
hGetByte :: Handle -> IO Word8 hPutByte :: Handle -> Word8 -> IO () I like the "byte" terminology even though it may not be absolutely accurate.
Simon suggested that these should only work on handles that are in binary mode, and that seems reasonable to me. Other people would like them to work on files open in text mode. I am not sure why we need that, text mode suggests that we want to think of the file as containing characters and not bytes (and should ideally support different character encodings etc). If a programmer needs to muck around with the raw bytes in a "text" file they can always implement that on top of the binary file interface. I also don't think that '\n' should flush buffers in binary mode --- I would like a binary file to simply contain bytes. I don't think that line buffering makes sense in that situation, but perhaps treating line buffering as block buffering is reasonable (i.e. lines are as big as the buffer). There was also a request for operations that get multiple bytes:
readBinaryFile :: FilePath -> IO [Word8] writeBinaryFile :: FilePat -> [Word8] -> IO ()
hGetBytes :: Handle -> Int -> IO (UArray Int Word8) hPutBytes :: Handle -> UArray Int Word8 -> IO () With the current interface we can kind of implemenet something like
I think something like that might be useful, but I would prefer if they were using handles, and arrays, e.g. something like: this using hGetArray, and unsafeFreeze, but it would be nicer to not have to use the unsafe freeze. The 'int' argument for the first function says how many bytes to read. Using these it seems easy to implement readBinaryFile and writeBinaryFile. -Iavor
participants (4)
-
Duncan Coutts
-
Iavor Diatchki
-
Jan-Willem Maessen
-
Simon Marlow