Copying Arrays

older
So how do people pronounce 'cabal'...

Tom Harper

29 May 2008 29 May '08

4:21 p.m.

I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. (Before you suggest it, lists are too slow/space consuming.) I was thinking that one possible solution is to use a strategy used in dynamic arrays, where everytime you run out of space you double the capacity of the array. Trouble is, most arrays in Haskell would require you to traverse the entire array and copy each individual cell, which would make it worthless. Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function where I could copy the entire array into the new one instead of copying it one value at a time? Is there another solution that I'm missing? -- Tom Harper MSc Computer Science '08 University of Oxford Mobile: +44 (0)7533 998 591 Skype: +1 949 273 4627 (harpertom)

Show replies by date

Bulat Ziganshin

29 May 29 May

4:49 p.m.

Hello Tom, Thursday, May 29, 2008, 8:21:25 PM, you wrote:

...

Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function

yes, MBA# allows to use memcpy. you can find examples of its usage right in the array package, for example: thawSTUArray :: Ix i => UArray i e -> ST s (STUArray s i e) thawSTUArray (UArray l u arr#) = ST $ \s1# -> case sizeofByteArray# arr# of { n# -> case newByteArray# n# s1# of { (# s2#, marr# #) -> case unsafeCoerce# memcpy marr# arr# n# s2# of { (# s3#, () #) -> (# s3#, STUArray l u marr# #) }}} foreign import ccall unsafe "memcpy" memcpy :: MutableByteArray# RealWorld -> ByteArray# -> Int# -> IO () i can give you further explanations if required -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Jefferson Heard

4:54 p.m.

Define too slow, time-consuming? I've dealt with this problem a lot at this point, and I've been able to slurp up CSV files of several gigabytes in the same amount of time as I generally do in C. I have a feeling an array solution is just going to bog you down more, however if you insist, I would also suggest writing your I/O in C and returning a ForeignPtr to something and work from there. -- Jeff On Thu, May 29, 2008 at 12:21 PM, Tom Harper wrote:

...

I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. (Before you suggest it, lists are too slow/space consuming.) I was thinking that one possible solution is to use a strategy used in dynamic arrays, where everytime you run out of space you double the capacity of the array. Trouble is, most arrays in Haskell would require you to traverse the entire array and copy each individual cell, which would make it worthless.

Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function where I could copy the entire array into the new one instead of copying it one value at a time? Is there another solution that I'm missing?

-- Tom Harper MSc Computer Science '08 University of Oxford Mobile: +44 (0)7533 998 591 Skype: +1 949 273 4627 (harpertom) _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- I try to take things like a crow; war and chaos don't always ruin a picnic, they just mean you have to be careful what you swallow. -- Jessica Edwards

Adrian Neumann

4:58 p.m.

Isn't fast IO what ByteStrings where invented for? Adrian Tom Harper schrieb:

...

I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. (Before you suggest it, lists are too slow/space consuming.) I was thinking that one possible solution is to use a strategy used in dynamic arrays, where everytime you run out of space you double the capacity of the array. Trouble is, most arrays in Haskell would require you to traverse the entire array and copy each individual cell, which would make it worthless.

Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function where I could copy the entire array into the new one instead of copying it one value at a time? Is there another solution that I'm missing?

Jefferson Heard

5:06 p.m.

Exactly. Someone on the list gave me this example awhile back for reading CSV files. I can process a gigabyte (simple unpack and print to dev/null for IO testing purposes) in about two and a half seconds using this code. import Data.ByteString.Lazy.Char8 as C -- | Read a datafile and turn it into lists of columns readDatafileAndTranspose name = do sheet <- (transpose . map (C.split '\t') . C.lines) `fmap` C.readFile name return $ foldl' go M.empty sheet where go m (x:xs) = M.insert (C.unpack x) xs m 2008/5/29 Adrian Neumann :

...

Isn't fast IO what ByteStrings where invented for?

Adrian

Tom Harper schrieb:

...
I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need. (Before you suggest it, lists are too slow/space consuming.) I was thinking that one possible solution is to use a strategy used in dynamic arrays, where everytime you run out of space you double the capacity of the array. Trouble is, most arrays in Haskell would require you to traverse the entire array and copy each individual cell, which would make it worthless.

Are there any low-level array types (MutableByteArray#, for example) that support a memcpy-like function where I could copy the entire array into the new one instead of copying it one value at a time? Is there another solution that I'm missing?

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- I try to take things like a crow; war and chaos don't always ruin a picnic, they just mean you have to be careful what you swallow. -- Jessica Edwards

Bryan O'Sullivan

5:07 p.m.

Tom Harper wrote:

...

I'm trying to implement some file I/O where I can read in a file to an array, but do so without having to know how much space I will need.

Why not just read it into a lazy ByteString? Are you looking to use an array with elements of a different type? You could then convert it to a strict ByteString.

Tom Harper

6:19 p.m.

...

Why not just read it into a lazy ByteString? Are you looking to use an array with elements of a different type? You could then convert it to a strict ByteString.

Because I'm writing the Unicode-friendly ByteString =p -- Tom Harper MSc Computer Science '08 University of Oxford Mobile: +44 (0)7533 998 591 Skype: +1 949 273 4627 (harpertom)

Bryan O'Sullivan

6:25 p.m.

Tom Harper wrote:

...

Because I'm writing the Unicode-friendly ByteString =p

Perhaps I'm not understanding. Why wouldn't you use ByteString for I/O, even if you're writing a different library? After all, ByteString's own internals use memcpy.

Duncan Coutts

6:57 p.m.

On Thu, 2008-05-29 at 11:25 -0700, Bryan O'Sullivan wrote:

...

Tom Harper wrote:

...
Because I'm writing the Unicode-friendly ByteString =p

Perhaps I'm not understanding. Why wouldn't you use ByteString for I/O, even if you're writing a different library? After all, ByteString's own internals use memcpy.

Just to clarify Tom's question... He's designing a proper Unicode type along the lines of ByteString. However it does not use a ByteString underneath so we cannot re-use the IO operations from ByteString, though obviously we can steal ideas. The reason we do not want to re-use ByteString as the underlying representation is because they're not good for short strings and we expect that for Unicode text (more than arbitrary blobs of binary data) people will want efficient short strings. So instead of using a ForeignPtr to pinned heap memory (read: slow allocation, lots of fragmentation) we're looking at using unpinned heap arrays. So that means MutableByteArr# (or STUArray for the first prototype). We cannot use memcpy because it operates on raw pointers but that's no good for a movable heap object like a ByteArr#. So that's the background to the question. I think the answer is either that there's a primitive to do this copying in the GHC.* libs or we'll need to add one. Duncan

Bulat Ziganshin

7:46 p.m.

New subject: Re[2]: Copying Arrays

Hello Duncan, Thursday, May 29, 2008, 10:57:02 PM, you wrote:

...

We cannot use memcpy because it operates on raw pointers but that's no good for a movable heap object like a ByteArr#.

it's false assumption - memcpy is used for copying non-pinned arrays as in the code from Data.Array.Base i've citated. the idea is that GC shouldn't occur between getting array address and actual memcpy call. you may consult Simon Marlow -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Don Stewart

8:27 p.m.

duncan.coutts:

...

On Thu, 2008-05-29 at 11:25 -0700, Bryan O'Sullivan wrote:

...
Tom Harper wrote:

...
Because I'm writing the Unicode-friendly ByteString =p

Perhaps I'm not understanding. Why wouldn't you use ByteString for I/O, even if you're writing a different library? After all, ByteString's own internals use memcpy.

Just to clarify Tom's question...

He's designing a proper Unicode type along the lines of ByteString. However it does not use a ByteString underneath so we cannot re-use the IO operations from ByteString, though obviously we can steal ideas.

The reason we do not want to re-use ByteString as the underlying representation is because they're not good for short strings and we expect that for Unicode text (more than arbitrary blobs of binary data) people will want efficient short strings.

So instead of using a ForeignPtr to pinned heap memory (read: slow allocation, lots of fragmentation) we're looking at using unpinned heap arrays. So that means MutableByteArr# (or STUArray for the first prototype).

We cannot use memcpy because it operates on raw pointers but that's no good for a movable heap object like a ByteArr#.

So that's the background to the question.

I think the answer is either that there's a primitive to do this copying in the GHC.* libs or we'll need to add one.

There are these guys, foreign import ccall unsafe "__hscore_memcpy_dst_off" memcpy_baoff_ba :: RawBuffer -> CInt -> RawBuffer -> CSize -> IO (Ptr ()) foreign import ccall unsafe "__hscore_memcpy_dst_off" memcpy_baoff_ptr :: RawBuffer -> CInt -> Ptr a -> CSize -> IO (Ptr ()) type RawBuffer = MutableByteArray# RealWorld

Ketil Malde

30 May 30 May

7 a.m.

Duncan Coutts writes:

...

...
...
Because I'm writing the Unicode-friendly ByteString =p

...

He's designing a proper Unicode type along the lines of ByteString.

So - storing 22.5 bit code points instead of 8-bit quantities? Or storing whatever representation from the input, and providing a nice interface on top?

...

...
Perhaps I'm not understanding. Why wouldn't you use ByteString for I/O,

Like everybody else, my first reaction is to put a layer (like Char8) on top of lazy bytestrings. For variable-length encodings, you lose direct indexing, but I think this is not very common, and if you need it, you should convert to a fixed length encoding instead. Since a BS is basically a (pointer to array,offset,length) triple, it should be relatively easy to ensure that you don't break a wide char between chunks by adjusting the length (which doesn't have to match the actual array length).

...

The reason we do not want to re-use ByteString as the underlying representation is because they're not good for short strings and we expect that for Unicode text (more than arbitrary blobs of binary data) people will want efficient short strings.

I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text? -k -- If I haven't seen further, it is by standing in the footprints of giants

Bulat Ziganshin

7:45 a.m.

New subject: Re[2]: Copying Arrays

Hello Ketil, Friday, May 30, 2008, 11:00:10 AM, you wrote:

...

I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text?

BS package require too much memory for storing short strings. alternative package that minimizes memory would be useful -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Johan Tibell

8:02 a.m.

On Fri, May 30, 2008 at 9:00 AM, Ketil Malde wrote:

...

Duncan Coutts writes:

...
The reason we do not want to re-use ByteString as the underlying representation is because they're not good for short strings and we expect that for Unicode text (more than arbitrary blobs of binary data) people will want efficient short strings.

I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text?

But ByteStrings are neither ASCII nor 8-bit Latin text! The latter might be internally represented using an 8-bit encoding but saying that they are the same would be to confuse representation with intended use. The use case for ByteString is representing a sequence of bytes used in e.g. fast binary socket and file I/O. The intent of the not-yet-existing Unicode string is to represent text not bytes. These are different use cases so having to make different trade-offs shouldn't come as a surprise. To give just one example, short (Unicode) strings are common as keys in associative data structures like maps where fast allocation, small memory footprint and fast comparison is important. That being said it's entirely possible that I didn't get your point. Can I also here insert a plea for keeping lazy I/O out of the new Unicode module? -- Johan

Ketil Malde

8:38 a.m.

"Johan Tibell" writes:

...

...
I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text?

...

But ByteStrings are neither ASCII nor 8-bit Latin text! [...] The intent of the not-yet-existing Unicode string is to represent text not bytes.

Right, so this will replace the .Char8 modules as well? What confused me was my misunderstanding Duncan to mean that Unicode text would somehow imply shorter strings than non-Unicode (i.e. 8-bit) text.

...

To give just one example, short (Unicode) strings are common as keys in associative data structures like maps

I guess typically, you'd break things down to words, so strings of lenght 4-10 or so. BS uses three words and LBS four (IIRC), so the cost of sharing typically outweighs the benefit.

...

Can I also here insert a plea for keeping lazy I/O out of the new Unicode module?

I use ByteString.Lazy almost exclusively. I realize it there's a penalty in time and space, but the ability to write applications that stream over multi-Gb files is essential. Of course, these applications couldn't care less about Unicode, so perhaps the usage is different. -k -- If I haven't seen further, it is by standing in the footprints of giants

Johan Tibell

9:40 a.m.

Hi! On Fri, May 30, 2008 at 10:38 AM, Ketil Malde wrote:

...

"Johan Tibell" writes:

...
The intent of the not-yet-existing Unicode string is to represent text not bytes.

Right, so this will replace the .Char8 modules as well? What confused me was my misunderstanding Duncan to mean that Unicode text would somehow imply shorter strings than non-Unicode (i.e. 8-bit) text.

Yes.

...

...
To give just one example, short (Unicode) strings are common as keys in associative data structures like maps

I guess typically, you'd break things down to words, so strings of lenght 4-10 or so. BS uses three words and LBS four (IIRC), so the cost of sharing typically outweighs the benefit.

I'm not sure if you would have much sharing in a map as the keys will be unique.

...

...
Can I also here insert a plea for keeping lazy I/O out of the new Unicode module?

I use ByteString.Lazy almost exclusively. I realize it there's a penalty in time and space, but the ability to write applications that stream over multi-Gb files is essential.

Lazy I/O comes with a penalty in terms of correctness! Pretending that I/O and the underlying resource allocations (e.g. file handles) aren't observable is bad. Lazy I/O is kinda, maybe usable for small scripts that reads a file or two an spits out a result but for servers it doesn't work at all. Lazy I/O requires unsafe* functions and is therefore, unsafe. The finalizers required can be arbitrary complex depending on what kind of resources need to be allocated. The simple case is a file handle but there's no reason we might need sockets, locks, etc to create the lazy ByteString. Here are two possible interfaces for safe I/O. One isstream based one with explicit close and the other fold based one (i.e. inversion of control):

...

import qualified Data.ByteString as S

-- Stream based I/O. class InputStream s where read :: s -> IO Word8 readN :: s -> Int -> IO S.ByteString -- efficient block reads close :: s -> IO ()

openBinaryFile :: InputStream s => FilePath -> IO s

or a left fold over the file's content. The 'foldBytes' function can close the file at EOF.

...

-- Left fold/callback based I/O. foldBytes :: FilePath -> (seed -> Word8 -> Either seed seed) -> seed -> IO seed -- Efficient block reads. foldChunks :: FilePath -> (seed -> S.ByteString -> Either seed seed) -> seed -> IO seed

on top of this you might want monadic versions of the above two functions. The case for a Unicode type are analogous.

...

Of course, these applications couldn't care less about Unicode, so perhaps the usage is different.

The issue of lazy I/O is orthogonal to ByteString vs Unicode(String). Cheers, Johan

Ketil Malde

12:35 p.m.

"Johan Tibell" writes:

...

Lazy I/O comes with a penalty in terms of correctness!

Is there a page describing this in more detail? I believe my programs to be correct, but would like to know if there are additional pitfalls, beyond hClosing a handle with outstanding (unevaluated) reads.

...

Lazy I/O is kinda, maybe usable for small scripts that reads a file or two and spits out a result but for servers it doesn't work at all.

So don't use it for servers, then?

...

Here are two possible interfaces for safe I/O. One isstream based one with explicit close and the other fold based one (i.e. inversion of control):

Say I have two files with variable-size records, that need to be merged. In other words, I can do something like: do xs <- readFile "X" ys <- readFile "Y" return $ zipWith combine (parseX xs) (parseY ys) Clearly in the "small script" category and not "servers", but is it still unsafe?

...

...
-- Stream based I/O. -- Left fold/callback based I/O.

Perhaps I don't quite understand the concepts, but I don't see how to implement the example as elegantly and simply with these approaches? -k -- If I haven't seen further, it is by standing in the footprints of giants

Duncan Coutts

10:11 a.m.

On Fri, 2008-05-30 at 10:38 +0200, Ketil Malde wrote:

...

"Johan Tibell" writes:

...
...
I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text?

...
But ByteStrings are neither ASCII nor 8-bit Latin text! [...] The intent of the not-yet-existing Unicode string is to represent text not bytes.

Right, so this will replace the .Char8 modules as well?

No, there is still a use-case for that, namely all these network protocols and file formats that mix binary and ASCII data. The only bad thing about the .Char8 modules is how people have picked them up to represent text exclusively which is a step back from [Char] in terms of Unicode stuff.

...

What confused me was my misunderstanding Duncan to mean that Unicode text would somehow imply shorter strings than non-Unicode (i.e. 8-bit) text.

ByteString is already not a good choice for short ASCII text. Duncan

Johan Tibell

11:01 a.m.

On Fri, May 30, 2008 at 12:11 PM, Duncan Coutts wrote:

...

On Fri, 2008-05-30 at 10:38 +0200, Ketil Malde wrote:

...
"Johan Tibell" writes:

...
But ByteStrings are neither ASCII nor 8-bit Latin text! [...] The intent of the not-yet-existing Unicode string is to represent text not bytes.

Right, so this will replace the .Char8 modules as well?

No, there is still a use-case for that, namely all these network protocols and file formats that mix binary and ASCII data.

After implementing an HTTP parser using ByteString in my web server I can say that the only thing you need from the .Char8 module is `pack' to be able to write byte literals. Matching bytes is better done using something like a `ByteSet' rather than the .Char8 functions as it gives better performance. -- Johan

Isaac Dupree

12:03 p.m.

it makes me wonder: can we support concatenation with sharing (e.g. the "rope" data structure)

Bryan O'Sullivan

3:49 p.m.

Ketil Malde wrote:

...

I guess this is where I don't follow: why would you need more short strings for Unicode text than for ASCII or 8-bit latin text?

Because ByteString is optimised for dealing with big blobs of binary data, its performance when you split a big ByteString into a pile of words is quite poor. Also, people use ByteStrings for dealing with text mostly because their native character sets let them get away with it, not because it's an intrinsically good idea.

Jed Brown

29 May 29 May

6:35 p.m.

On Thu 2008-05-29 19:19, Tom Harper wrote:

...

...
Why not just read it into a lazy ByteString? Are you looking to use an array with elements of a different type? You could then convert it to a strict ByteString.

Because I'm writing the Unicode-friendly ByteString =p

Uh, ByteString is Unicode-agnostic. ByteString.Char8 is not. So why not do IO with lazy ByteString and parse into your own representation (which might look a lot like StorableVector)? Jed

Chad Scherrer

6:45 p.m.

Jed Brown writes:

...

Uh, ByteString is Unicode-agnostic. ByteString.Char8 is not. So why not do IO with lazy ByteString and parse into your own representation (which might look a lot like StorableVector)?

One problem you might run into doing it this way is if a wide character is split between two different arrays. In that case you have to do some post-porcessing to put the pieces back together. More efficient, I think, if you could force a given alignment when reading in the lazy bytestring. But there's not a way to do that, is there? I hope this makes sense. It's the problem I ran into when I tried once to use lazy bytestrings instead of a storable vector, reasoning that the more recent fusion work in bytestring would give a speed boost. But then I was doing numerical stuff, and I don't know much about unicode. Chad

Jed Brown

7:03 p.m.

On Thu 2008-05-29 18:45, Chad Scherrer wrote:

...

Jed Brown writes:

...
Uh, ByteString is Unicode-agnostic. ByteString.Char8 is not. So why not do IO with lazy ByteString and parse into your own representation (which might look a lot like StorableVector)?

One problem you might run into doing it this way is if a wide character is split between two different arrays. In that case you have to do some post-porcessing to put the pieces back together. More efficient, I think, if you could force a given alignment when reading in the lazy bytestring. But there's not a way to do that, is there?

Unless you are reading UTF-32, you won't know what alignment you want until you get there. If I remember correctly, the default block size is nicely aligned so that in practice you shouldn't have to worry about a chunk ending with weird alignment. However, such alignment issues shouldn't affect you unless you are using the internal interface. If you want fast indexing, you have to parse one character at a time anyway so you won't gain anything by unsafe casting (or memcpy) into your data structure. Jed

6240

Age (days ago)

6241

Last active (days ago)

List overview

Download

23 comments

12 participants

participants (12)

Adrian Neumann
Bryan O'Sullivan
Bulat Ziganshin
Chad Scherrer
Don Stewart
Duncan Coutts
Isaac Dupree
Jed Brown
Jefferson Heard
Johan Tibell
Ketil Malde
Tom Harper

Copying Arrays

Tom Harper

Bulat Ziganshin

Adrian Neumann

Tom Harper

Bulat Ziganshin

Bulat Ziganshin

Johan Tibell

Johan Tibell

Johan Tibell

Jed Brown

Jed Brown

tags

participants (12)