
I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. (Before you suggest it, lists are too slow/space consuming.) I was thinking that one possible solution is to use a strategy used in dynamic arrays, where everytime you run out of space you double the capacity of the array. Trouble is, most arrays in Haskell would require you to traverse the entire array and copy each individual cell, which would make it worthless. Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function where I could copy the entire array into the new one instead of copying it one value at a time? Is there another solution that I'm missing? -- Tom Harper MSc Computer Science '08 University of Oxford Mobile: +44 (0)7533 998 591 Skype: +1 949 273 4627 (harpertom)

Hello Tom, Thursday, May 29, 2008, 8:21:25 PM, you wrote:
Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function
yes, MBA# allows to use memcpy. you can find examples of its usage right in the array package, for example: thawSTUArray :: Ix i => UArray i e -> ST s (STUArray s i e) thawSTUArray (UArray l u arr#) = ST $ \s1# -> case sizeofByteArray# arr# of { n# -> case newByteArray# n# s1# of { (# s2#, marr# #) -> case unsafeCoerce# memcpy marr# arr# n# s2# of { (# s3#, () #) -> (# s3#, STUArray l u marr# #) }}} foreign import ccall unsafe "memcpy" memcpy :: MutableByteArray# RealWorld -> ByteArray# -> Int# -> IO () i can give you further explanations if required -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Define too slow, time-consuming? I've dealt with this problem a lot
at this point, and I've been able to slurp up CSV files of several
gigabytes in the same amount of time as I generally do in C. I have a
feeling an array solution is just going to bog you down more, however
if you insist, I would also suggest writing your I/O in C and
returning a ForeignPtr to something and work from there.
-- Jeff
On Thu, May 29, 2008 at 12:21 PM, Tom Harper
I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. (Before you suggest it, lists are too slow/space consuming.) I was thinking that one possible solution is to use a strategy used in dynamic arrays, where everytime you run out of space you double the capacity of the array. Trouble is, most arrays in Haskell would require you to traverse the entire array and copy each individual cell, which would make it worthless.
Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function where I could copy the entire array into the new one instead of copying it one value at a time? Is there another solution that I'm missing?
-- Tom Harper MSc Computer Science '08 University of Oxford Mobile: +44 (0)7533 998 591 Skype: +1 949 273 4627 (harpertom) _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- I try to take things like a crow; war and chaos don't always ruin a picnic, they just mean you have to be careful what you swallow. -- Jessica Edwards

Isn't fast IO what ByteStrings where invented for? Adrian Tom Harper schrieb:
I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. (Before you suggest it, lists are too slow/space consuming.) I was thinking that one possible solution is to use a strategy used in dynamic arrays, where everytime you run out of space you double the capacity of the array. Trouble is, most arrays in Haskell would require you to traverse the entire array and copy each individual cell, which would make it worthless.
Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function where I could copy the entire array into the new one instead of copying it one value at a time? Is there another solution that I'm missing?

Exactly. Someone on the list gave me this example awhile back for
reading CSV files. I can process a gigabyte (simple unpack and print
to dev/null for IO testing purposes) in about two and a half seconds
using this code.
import Data.ByteString.Lazy.Char8 as C
-- | Read a datafile and turn it into lists of columns
readDatafileAndTranspose name = do
sheet <- (transpose . map (C.split '\t') . C.lines) `fmap` C.readFile name
return $ foldl' go M.empty sheet
where go m (x:xs) = M.insert (C.unpack x) xs m
2008/5/29 Adrian Neumann
Isn't fast IO what ByteStrings where invented for?
Adrian
Tom Harper schrieb:
I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. (Before you suggest it, lists are too slow/space consuming.) I was thinking that one possible solution is to use a strategy used in dynamic arrays, where everytime you run out of space you double the capacity of the array. Trouble is, most arrays in Haskell would require you to traverse the entire array and copy each individual cell, which would make it worthless.
Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function where I could copy the entire array into the new one instead of copying it one value at a time? Is there another solution that I'm missing?
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- I try to take things like a crow; war and chaos don't always ruin a picnic, they just mean you have to be careful what you swallow. -- Jessica Edwards

Tom Harper wrote:
I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need.
Why not just read it into a lazy ByteString? Are you looking to use an array with elements of a different type? You could then convert it to a strict ByteString.

Why not just read it into a lazy ByteString? Are you looking to use an array with elements of a different type? You could then convert it to a strict ByteString.
Because I'm writing the Unicode-friendly ByteString =p -- Tom Harper MSc Computer Science '08 University of Oxford Mobile: +44 (0)7533 998 591 Skype: +1 949 273 4627 (harpertom)

On Thu, 2008-05-29 at 11:25 -0700, Bryan O'Sullivan wrote:
Tom Harper wrote:
Because I'm writing the Unicode-friendly ByteString =p
Perhaps I'm not understanding. Why wouldn't you use ByteString for I/O, even if you're writing a different library? After all, ByteString's own internals use memcpy.
Just to clarify Tom's question... He's designing a proper Unicode type along the lines of ByteString. However it does not use a ByteString underneath so we cannot re-use the IO operations from ByteString, though obviously we can steal ideas. The reason we do not want to re-use ByteString as the underlying representation is because they're not good for short strings and we expect that for Unicode text (more than arbitrary blobs of binary data) people will want efficient short strings. So instead of using a ForeignPtr to pinned heap memory (read: slow allocation, lots of fragmentation) we're looking at using unpinned heap arrays. So that means MutableByteArr# (or STUArray for the first prototype). We cannot use memcpy because it operates on raw pointers but that's no good for a movable heap object like a ByteArr#. So that's the background to the question. I think the answer is either that there's a primitive to do this copying in the GHC.* libs or we'll need to add one. Duncan

Hello Duncan, Thursday, May 29, 2008, 10:57:02 PM, you wrote:
We cannot use memcpy because it operates on raw pointers but that's no good for a movable heap object like a ByteArr#.
it's false assumption - memcpy is used for copying non-pinned arrays as in the code from Data.Array.Base i've citated. the idea is that GC shouldn't occur between getting array address and actual memcpy call. you may consult Simon Marlow -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

duncan.coutts:
On Thu, 2008-05-29 at 11:25 -0700, Bryan O'Sullivan wrote:
Tom Harper wrote:
Because I'm writing the Unicode-friendly ByteString =p
Perhaps I'm not understanding. Why wouldn't you use ByteString for I/O, even if you're writing a different library? After all, ByteString's own internals use memcpy.
Just to clarify Tom's question...
He's designing a proper Unicode type along the lines of ByteString. However it does not use a ByteString underneath so we cannot re-use the IO operations from ByteString, though obviously we can steal ideas.
The reason we do not want to re-use ByteString as the underlying representation is because they're not good for short strings and we expect that for Unicode text (more than arbitrary blobs of binary data) people will want efficient short strings.
So instead of using a ForeignPtr to pinned heap memory (read: slow allocation, lots of fragmentation) we're looking at using unpinned heap arrays. So that means MutableByteArr# (or STUArray for the first prototype).
We cannot use memcpy because it operates on raw pointers but that's no good for a movable heap object like a ByteArr#.
So that's the background to the question.
I think the answer is either that there's a primitive to do this copying in the GHC.* libs or we'll need to add one.
There are these guys, foreign import ccall unsafe "__hscore_memcpy_dst_off" memcpy_baoff_ba :: RawBuffer -> CInt -> RawBuffer -> CSize -> IO (Ptr ()) foreign import ccall unsafe "__hscore_memcpy_dst_off" memcpy_baoff_ptr :: RawBuffer -> CInt -> Ptr a -> CSize -> IO (Ptr ()) type RawBuffer = MutableByteArray# RealWorld

Duncan Coutts
Because I'm writing the Unicode-friendly ByteString =p
He's designing a proper Unicode type along the lines of ByteString.
So - storing 22.5 bit code points instead of 8-bit quantities? Or storing whatever representation from the input, and providing a nice interface on top?
Perhaps I'm not understanding. Why wouldn't you use ByteString for I/O,
Like everybody else, my first reaction is to put a layer (like Char8) on top of lazy bytestrings. For variable-length encodings, you lose direct indexing, but I think this is not very common, and if you need it, you should convert to a fixed length encoding instead. Since a BS is basically a (pointer to array,offset,length) triple, it should be relatively easy to ensure that you don't break a wide char between chunks by adjusting the length (which doesn't have to match the actual array length).
The reason we do not want to re-use ByteString as the underlying representation is because they're not good for short strings and we expect that for Unicode text (more than arbitrary blobs of binary data) people will want efficient short strings.
I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text? -k -- If I haven't seen further, it is by standing in the footprints of giants

Hello Ketil, Friday, May 30, 2008, 11:00:10 AM, you wrote:
I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text?
BS package require too much memory for storing short strings. alternative package that minimizes memory would be useful -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Fri, May 30, 2008 at 9:00 AM, Ketil Malde
Duncan Coutts
writes: The reason we do not want to re-use ByteString as the underlying representation is because they're not good for short strings and we expect that for Unicode text (more than arbitrary blobs of binary data) people will want efficient short strings.
I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text?
But ByteStrings are neither ASCII nor 8-bit Latin text! The latter might be internally represented using an 8-bit encoding but saying that they are the same would be to confuse representation with intended use. The use case for ByteString is representing a sequence of bytes used in e.g. fast binary socket and file I/O. The intent of the not-yet-existing Unicode string is to represent text not bytes. These are different use cases so having to make different trade-offs shouldn't come as a surprise. To give just one example, short (Unicode) strings are common as keys in associative data structures like maps where fast allocation, small memory footprint and fast comparison is important. That being said it's entirely possible that I didn't get your point. Can I also here insert a plea for keeping lazy I/O out of the new Unicode module? -- Johan

"Johan Tibell"
I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text?
But ByteStrings are neither ASCII nor 8-bit Latin text! [...] The intent of the not-yet-existing Unicode string is to represent text not bytes.
Right, so this will replace the .Char8 modules as well? What confused me was my misunderstanding Duncan to mean that Unicode text would somehow imply shorter strings than non-Unicode (i.e. 8-bit) text.
To give just one example, short (Unicode) strings are common as keys in associative data structures like maps
I guess typically, you'd break things down to words, so strings of lenght 4-10 or so. BS uses three words and LBS four (IIRC), so the cost of sharing typically outweighs the benefit.
Can I also here insert a plea for keeping lazy I/O out of the new Unicode module?
I use ByteString.Lazy almost exclusively. I realize it there's a penalty in time and space, but the ability to write applications that stream over multi-Gb files is essential. Of course, these applications couldn't care less about Unicode, so perhaps the usage is different. -k -- If I haven't seen further, it is by standing in the footprints of giants

Hi!
On Fri, May 30, 2008 at 10:38 AM, Ketil Malde
"Johan Tibell"
writes: The intent of the not-yet-existing Unicode string is to represent text not bytes.
Right, so this will replace the .Char8 modules as well? What confused me was my misunderstanding Duncan to mean that Unicode text would somehow imply shorter strings than non-Unicode (i.e. 8-bit) text.
Yes.
To give just one example, short (Unicode) strings are common as keys in associative data structures like maps
I guess typically, you'd break things down to words, so strings of lenght 4-10 or so. BS uses three words and LBS four (IIRC), so the cost of sharing typically outweighs the benefit.
I'm not sure if you would have much sharing in a map as the keys will be unique.
Can I also here insert a plea for keeping lazy I/O out of the new Unicode module?
I use ByteString.Lazy almost exclusively. I realize it there's a penalty in time and space, but the ability to write applications that stream over multi-Gb files is essential.
Lazy I/O comes with a penalty in terms of correctness! Pretending that I/O and the underlying resource allocations (e.g. file handles) aren't observable is bad. Lazy I/O is kinda, maybe usable for small scripts that reads a file or two an spits out a result but for servers it doesn't work at all. Lazy I/O requires unsafe* functions and is therefore, unsafe. The finalizers required can be arbitrary complex depending on what kind of resources need to be allocated. The simple case is a file handle but there's no reason we might need sockets, locks, etc to create the lazy ByteString. Here are two possible interfaces for safe I/O. One isstream based one with explicit close and the other fold based one (i.e. inversion of control):
import qualified Data.ByteString as S
-- Stream based I/O. class InputStream s where read :: s -> IO Word8 readN :: s -> Int -> IO S.ByteString -- efficient block reads close :: s -> IO ()
openBinaryFile :: InputStream s => FilePath -> IO s
or a left fold over the file's content. The 'foldBytes' function can close the file at EOF.
-- Left fold/callback based I/O. foldBytes :: FilePath -> (seed -> Word8 -> Either seed seed) -> seed -> IO seed -- Efficient block reads. foldChunks :: FilePath -> (seed -> S.ByteString -> Either seed seed) -> seed -> IO seed
on top of this you might want monadic versions of the above two functions. The case for a Unicode type are analogous.
Of course, these applications couldn't care less about Unicode, so perhaps the usage is different.
The issue of lazy I/O is orthogonal to ByteString vs Unicode(String). Cheers, Johan

"Johan Tibell"
Lazy I/O comes with a penalty in terms of correctness!
Is there a page describing this in more detail? I believe my programs to be correct, but would like to know if there are additional pitfalls, beyond hClosing a handle with outstanding (unevaluated) reads.
Lazy I/O is kinda, maybe usable for small scripts that reads a file or two and spits out a result but for servers it doesn't work at all.
So don't use it for servers, then?
Here are two possible interfaces for safe I/O. One isstream based one with explicit close and the other fold based one (i.e. inversion of control):
Say I have two files with variable-size records, that need to be merged. In other words, I can do something like: do xs <- readFile "X" ys <- readFile "Y" return $ zipWith combine (parseX xs) (parseY ys) Clearly in the "small script" category and not "servers", but is it still unsafe?
-- Stream based I/O. -- Left fold/callback based I/O.
Perhaps I don't quite understand the concepts, but I don't see how to implement the example as elegantly and simply with these approaches? -k -- If I haven't seen further, it is by standing in the footprints of giants

On Fri, 2008-05-30 at 10:38 +0200, Ketil Malde wrote:
"Johan Tibell"
writes: I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text?
But ByteStrings are neither ASCII nor 8-bit Latin text! [...] The intent of the not-yet-existing Unicode string is to represent text not bytes.
Right, so this will replace the .Char8 modules as well?
No, there is still a use-case for that, namely all these network protocols and file formats that mix binary and ASCII data. The only bad thing about the .Char8 modules is how people have picked them up to represent text exclusively which is a step back from [Char] in terms of Unicode stuff.
What confused me was my misunderstanding Duncan to mean that Unicode text would somehow imply shorter strings than non-Unicode (i.e. 8-bit) text.
ByteString is already not a good choice for short ASCII text. Duncan

On Fri, May 30, 2008 at 12:11 PM, Duncan Coutts
On Fri, 2008-05-30 at 10:38 +0200, Ketil Malde wrote:
"Johan Tibell"
writes: But ByteStrings are neither ASCII nor 8-bit Latin text! [...] The intent of the not-yet-existing Unicode string is to represent text not bytes.
Right, so this will replace the .Char8 modules as well?
No, there is still a use-case for that, namely all these network protocols and file formats that mix binary and ASCII data.
After implementing an HTTP parser using ByteString in my web server I can say that the only thing you need from the .Char8 module is `pack' to be able to write byte literals. Matching bytes is better done using something like a `ByteSet' rather than the .Char8 functions as it gives better performance. -- Johan

it makes me wonder: can we support concatenation with sharing (e.g. the "rope" data structure)

Ketil Malde wrote:
I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text?
Because ByteString is optimised for dealing with big blobs of binary data, its performance when you split a big ByteString into a pile of words is quite poor. Also, people use ByteStrings for dealing with text mostly because their native character sets let them get away with it, not because it's an intrinsically good idea.

On Thu 2008-05-29 19:19, Tom Harper wrote:
Why not just read it into a lazy ByteString? Are you looking to use an array with elements of a different type? You could then convert it to a strict ByteString.
Because I'm writing the Unicode-friendly ByteString =p
Uh, ByteString is Unicode-agnostic. ByteString.Char8 is not. So why not do IO with lazy ByteString and parse into your own representation (which might look a lot like StorableVector)? Jed

Jed Brown
Uh, ByteString is Unicode-agnostic. ByteString.Char8 is not. So why not do IO with lazy ByteString and parse into your own representation (which might look a lot like StorableVector)?
One problem you might run into doing it this way is if a wide character is split between two different arrays. In that case you have to do some post-porcessing to put the pieces back together. More efficient, I think, if you could force a given alignment when reading in the lazy bytestring. But there's not a way to do that, is there? I hope this makes sense. It's the problem I ran into when I tried once to use lazy bytestrings instead of a storable vector, reasoning that the more recent fusion work in bytestring would give a speed boost. But then I was doing numerical stuff, and I don't know much about unicode. Chad

On Thu 2008-05-29 18:45, Chad Scherrer wrote:
Jed Brown
writes: Uh, ByteString is Unicode-agnostic. ByteString.Char8 is not. So why not do IO with lazy ByteString and parse into your own representation (which might look a lot like StorableVector)?
One problem you might run into doing it this way is if a wide character is split between two different arrays. In that case you have to do some post-porcessing to put the pieces back together. More efficient, I think, if you could force a given alignment when reading in the lazy bytestring. But there's not a way to do that, is there?
Unless you are reading UTF-32, you won't know what alignment you want until you get there. If I remember correctly, the default block size is nicely aligned so that in practice you shouldn't have to worry about a chunk ending with weird alignment. However, such alignment issues shouldn't affect you unless you are using the internal interface. If you want fast indexing, you have to parse one character at a time anyway so you won't gain anything by unsafe casting (or memcpy) into your data structure. Jed
participants (12)
-
Adrian Neumann
-
Bryan O'Sullivan
-
Bulat Ziganshin
-
Chad Scherrer
-
Don Stewart
-
Duncan Coutts
-
Isaac Dupree
-
Jed Brown
-
Jefferson Heard
-
Johan Tibell
-
Ketil Malde
-
Tom Harper