Array, Vector, Bytestring

Hi everyone, Every time I want to use an array in Haskell, I find myself having to look up in the doc how they are used, which exactly are the modules I have to import ... and I am a bit tired of staring at type signatures for 10 minutes to figure out how these arrays work every time I use them (It's even worse when you have to write the signatures). I wonder how other people perceive this issue and what possible solutions could be. Eg. look at the type signature for changing a single entry in an array: * vector package: write :: PrimMonad m => MVector (PrimState m) a -> Int -> a -> m () * array package: writeArray :: (MArray a e m, Ix i) => a i e -> i -> e -> m () * bytestring package: not available * a reasonable type signature would be: write :: MVector a -> Int -> a -> ST s a # To many different implementations Why do we need so many different implementations of the same thing? In the ghc libraries alone we have a vector, array and bytestring package all of which do the same thing, as demonstrated for instance by the vector-bytestring package. To make matters worse, the haskell 2010 standard has includes a watered down version of array. I personally prefer the vector package but if you want to use libraries for strings, I am forced back to use bytesting. And now that Array was added to haskell 2010, that will start to pop up more and more. The multitude of already existing libraries is also why I am writing this to haskell-cafe and not spinning yet another implementation. # Index I don't really see a reason for having an index of a type other than Int and that starts somewhere else than at 0. While there might be corner cases where such a thing might be useful, it is only confusing and irritating for the (presumably) rest of us. # Storable vs Unboxed Is there really a difference between Storable and Unboxed arrays and if so can't this be fixed in the complier rather than having to expose this problem to the programmer? # ST s vs IO This is probably the hardest to resolve issue. The easiest solution is probably to just have a module for each of them as in the array package. I find the PrimState a bit complicated and circuitous. The ideal solution would be to have type IO a = ST RealWorld# a in the next haskell standard. If I understand it correctly this would let you declare the types as follows but and still use it in an IO context. write :: MVector a -> Int -> a -> ST s a Would this work? I know it would probably screw up all instances instance ... IO where On the other hand, this is not only an issue with arrays but all procedurally accelerated data structures like hash-tables. Silvio

silvio
Hi everyone,
Every time I want to use an array in Haskell, I find myself having to look up in the doc how they are used, which exactly are the modules I have to import ... and I am a bit tired of staring at type signatures for 10 minutes to figure out how these arrays work every time I use them (It's even worse when you have to write the signatures). I wonder how other people perceive this issue and what possible solutions could be.
Recently I’ve started to perceive this issue as “hooray, we have lenses now, a generic interface for all the different messy stuff we have”. But yes, the inability to have One Common API for All Data Structures is bothering me as well.
Why do we need so many different implementations of the same thing? In the ghc libraries alone we have a vector, array and bytestring package all of which do the same thing, as demonstrated for instance by the vector-bytestring package. To make matters worse, the haskell 2010 standard has includes a watered down version of array.
Indeed. What we need is `text` for strings (and stop using `bytestring`) and reworked `vector` for arrays (with added code from `StorableVector` — basically a lazy ByteString-like chunked array).
# Index
I don't really see a reason for having an index of a type other than Int and that starts somewhere else than at 0.
It’s a bad idea. I, for one, don’t really see how writing `Vector (Vector (Vector Int))` can be considered even remotely satisfying by anyone. And if you’re considering 3D arrays “a corner case”, then I’m afraid I can’t agree with you. Also, arrays which allow negative indexing can save a lot of headache and prevent mistakes which generally occur when a programmer is forced to constantly keep in mind that index 2000 is actually 0 and 0 is −2000.
# Storable vs Unboxed
Is there really a difference between Storable and Unboxed arrays and if so can't this be fixed in the complier rather than having to expose this problem to the programmer?
Storable seems to be mainly for marshalling, and most people who need it are (probably) library writers. I don’t know for sure, though, but it doesn’t appear to be a big issue.
# ST s vs IO
This is probably the hardest to resolve issue. The easiest solution is probably to just have a module for each of them as in the array package. I find the PrimState a bit complicated and circuitous.
The ideal solution would be to have
type IO a = ST RealWorld# a
in the next haskell standard.
Sure, except that IO is actually *not* ST+Realworld, and only happens to be implemented like that in GHC (not in JHC, for instance). It has been discussed before: http://haskell.1045720.n5.nabble.com/IO-ST-RealWorld-td3190075.html . (Not to mention people attempting to rewrite RealWorld# values and create havoc and fire missiles everywhere expecting them to disappear the very moment they smile knowingly and `put` the original RealWorld# back.)

Artyom Kazak
silvio
писал(а) в своём письме Mon, 03 Jun 2013 22:16:08 +0300: Hi everyone,
Every time I want to use an array in Haskell, I find myself having to look up in the doc how they are used, which exactly are the modules I have to import ... and I am a bit tired of staring at type signatures for 10 minutes to figure out how these arrays work every time I use them (It's even worse when you have to write the signatures). I wonder how other people perceive this issue and what possible solutions could be.
Recently I’ve started to perceive this issue as “hooray, we have lenses now, a generic interface for all the different messy stuff we have”. But yes, the inability to have One Common API for All Data Structures is bothering me as well.
Why do we need so many different implementations of the same thing? In the ghc libraries alone we have a vector, array and bytestring package all of which do the same thing, as demonstrated for instance by the vector-bytestring package. To make matters worse, the haskell 2010 standard has includes a watered down version of array.
Indeed. What we need is `text` for strings (and stop using `bytestring`) and reworked `vector` for arrays (with added code from `StorableVector` — basically a lazy ByteString-like chunked array).
To be perfectly clear, ByteString and Text target much different use-cases and are hardly interchangeable. While ByteString is, as the name suggests, a string of bytes, Text is a string of characters in a Unicode encoding. When you are talking about unstructured binary data, you should most certainly be using ByteString. Cheers, - Ben

Oops.
Ben Gamari
To be perfectly clear, ByteString and Text target much different use-cases and are hardly interchangeable. While ByteString is, as the name suggests, a string of bytes, Text is a string of characters in a Unicode encoding. When you are talking about unstructured binary data, you should most certainly be using ByteString.
Why create a special case? Right now you should use ByteString, yes, but I wish I could just use a generic array of Word8.

On Mon, 03 Jun 2013 19:16:08 +0000
silvio
Hi everyone,
Every time I want to use an array in Haskell, I find myself having to look up in the doc how they are used, which exactly are the modules I have to import ... and I am a bit tired of staring at type signatures for 10 minutes to figure out how these arrays work every time I use them (It's even worse when you have to write the signatures). I wonder how other people perceive this issue and what possible solutions could be.
My opinion, it's every bit as bad you say it is... Not a clue as to what can be done about it. Probably yet another vector module.

How is this a problem? If you're representing text, use 'text'. If you're representing a string of bytes, use 'bytestring'. If you want an "array" of values, think c++ and use 'vector'. If you want to mutate arrays, first, make sure you do. You probably don't. If you're sure, use MVector. Don't use String, except to interface with legacy code. You probably want 'text'. Don't use Array. Anything it can be used for, can be done with 'vector'. - Clark This covers all the use-cases that I can think of. On Monday, June 3, 2013, wrote:
On Mon, 03 Jun 2013 19:16:08 +0000 silvio
javascript:;> wrote: Hi everyone,
Every time I want to use an array in Haskell, I find myself having to look up in the doc how they are used, which exactly are the modules I have to import ... and I am a bit tired of staring at type signatures for 10 minutes to figure out how these arrays work every time I use them (It's even worse when you have to write the signatures). I wonder how other people perceive this issue and what possible solutions could be.
My opinion, it's every bit as bad you say it is... Not a clue as to what can be done about it.
Probably yet another vector module.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org javascript:; http://www.haskell.org/mailman/listinfo/haskell-cafe

On Mon, Jun 3, 2013 at 7:45 PM, Clark Gaebel
How is this a problem?
If you're representing text, use 'text'. If you're representing a string of bytes, use 'bytestring'. If you want an "array" of values, think c++ and use 'vector'. If you want to mutate arrays, first, make sure you do. You probably don't. If you're sure, use MVector.
Don't use String, except to interface with legacy code. You probably want 'text'. Don't use Array. Anything it can be used for, can be done with 'vector'.
You have to build multidimensional accessors for vector yourself. Array supports them out of the box. I still prefer vector, but it's only fair to note that multidimensional data is a weak spot of vector. Jason

That's absolutely true. Wrappers around vector for your multidimensional access is probably best, but Vectors of Vectors are usually easier. But again, you're right. Multidimensional access is a pain. If it's a "matrix" of numerical values, you could take a look at 'hmatrix'. - Clark On Monday, June 3, 2013, Jason Dagit wrote:
On Mon, Jun 3, 2013 at 7:45 PM, Clark Gaebel
javascript:;> wrote: How is this a problem?
If you're representing text, use 'text'. If you're representing a string of bytes, use 'bytestring'. If you want an "array" of values, think c++ and use 'vector'. If you want to mutate arrays, first, make sure you do. You probably don't. If you're sure, use MVector.
Don't use String, except to interface with legacy code. You probably want 'text'. Don't use Array. Anything it can be used for, can be done with 'vector'.
You have to build multidimensional accessors for vector yourself. Array supports them out of the box. I still prefer vector, but it's only fair to note that multidimensional data is a weak spot of vector.
Jason

On Mon, 3 Jun 2013 23:19:38 -0400
Clark Gaebel
That's absolutely true. Wrappers around vector for your multidimensional access is probably best, but Vectors of Vectors are usually easier.
But again, you're right. Multidimensional access is a pain. If it's a "matrix" of numerical values, you could take a look at 'hmatrix'.
or repa

Hi Clark,
How is this a problem?
If you're representing text, use 'text'. If you're representing a string of bytes, use 'bytestring'. If you want an "array" of values, think c++ and use 'vector'.
the problem is that all those packages implement the exact same data type from scratch, instead of re-using an implementation of a general-purpose array internally. That is hardly desirable, nor is it necessary. Take care, Peter

On Tue, Jun 04, 2013 at 04:01:37PM +0200, Peter Simons wrote:
How is this a problem?
If you're representing text, use 'text'. If you're representing a string of bytes, use 'bytestring'. If you want an "array" of values, think c++ and use 'vector'.
the problem is that all those packages implement the exact same data type from scratch, instead of re-using an implementation of a general-purpose array internally. That is hardly desirable, nor is it necessary.
Just to clarify for those on the sidelines, the issue is duplication of implementation details, rather than duplication of functionality? Tom

Just to clarify for those on the sidelines, the issue is duplication of implementation details, rather than duplication of functionality?
Well to me, that is not the main issue. The main issue is that you have to study all of them and depending on which libraries you want to use have to convert between them, which could be expensive and is definitely annoying. I made a few simple benchmarks comparing the three libraries you can find the code attached. this is compiled with -O2 # simple sum of 1000000 Word8 elements Unboxed Vector 1.114060 ms Storable Vector 795.1207 us Primitive Vector 1.116145 ms ByteString 9.076256 ms array library has no fold or sum function # simple sum of 1000000 more or less randomly chosen elements Unboxed Vector (unsafe) 33.74364 ms Storable Vector (unsafe) 50.27273 ms Storable Vector (safe) 67.01634 ms Primitive Vector (unsafe) 56.29919 ms ByteString (unsafe) 19.29611 ms ByteString (safe) 18.29065 ms UArray (safe) 46.88719 ms unsafe does not exist for array So Unboxed can be better than Storable but doesn't need to be. Also, which implementation is faster depends very much on the problem at hand. And array is just missing half the needed features. Silvio

I really don't understand this concern.
These libraries are tuned for wildly different workloads and use cases, so
these sorts of micro benchmarks are an Apples to Frogs comparisons.
(even aside from the fact that you'll get very different perf if you used
-fllvm and set things up so the array indexing and associated loop code get
inlined and fused together!)
what is the actual concern? Strawman micro benchmarks that don't even
compare the respective libraries for their intended use cases seeems....
weird.
On Tue, Jun 4, 2013 at 12:49 PM, silvio
Just to clarify for those on the sidelines, the issue is duplication of
implementation details, rather than duplication of functionality?
Well to me, that is not the main issue. The main issue is that you have to study all of them and depending on which libraries you want to use have to convert between them, which could be expensive and is definitely annoying.
I made a few simple benchmarks comparing the three libraries you can find the code attached.
this is compiled with -O2
# simple sum of 1000000 Word8 elements
Unboxed Vector 1.114060 ms Storable Vector 795.1207 us Primitive Vector 1.116145 ms
ByteString 9.076256 ms
array library has no fold or sum function
# simple sum of 1000000 more or less randomly chosen elements
Unboxed Vector (unsafe) 33.74364 ms Storable Vector (unsafe) 50.27273 ms Storable Vector (safe) 67.01634 ms Primitive Vector (unsafe) 56.29919 ms
ByteString (unsafe) 19.29611 ms ByteString (safe) 18.29065 ms
UArray (safe) 46.88719 ms unsafe does not exist for array
So Unboxed can be better than Storable but doesn't need to be. Also, which implementation is faster depends very much on the problem at hand. And array is just missing half the needed features.
Silvio
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

These libraries are tuned for wildly different workloads and use cases, so these sorts of micro benchmarks are an Apples to Frogs comparisons.
You can argue that for any benchmark, but sometimes the choice is between Apples and Frogs. If you have some more extensive benchmarks I'm happy to have a look at them.
(even aside from the fact that you'll get very different perf if you used -fllvm and set things up so the array indexing and associated loop code get inlined and fused together!)
I think llvm should be default by now. In any case, if you write code, it is important to know how well something works out of box without you having to spend hours optimizing it.
what is the actual concern? Strawman micro benchmarks that don't even compare the respective libraries for their intended use cases seeems.... weird.
Perhaps i should have explained that better. If one library was clearly superior to the others, that would have made it easier to choose. Also I wanted to check if Unboxed was usually better than Storable as they are semantically the same (correct me if i'm wrong). Which it is in one example. Still I think Storable could be done so we don't need Unboxed, too. I would have said that sum/fold (i.e. consecutive access of the elements) is a reasonable use case for how we typically use bytestring and the random access sum is a reasonable you use case for how we typically use array/vector. Interestingly enough the performance was exactly opposed to these reasonable use cases. Silvio

On 05/06/13 02:49, silvio wrote:
Just to clarify for those on the sidelines, the issue is duplication of implementation details, rather than duplication of functionality?
Well to me, that is not the main issue. The main issue is that you have to study all of them and depending on which libraries you want to use have to convert between them, which could be expensive and is definitely annoying.
I made a few simple benchmarks comparing the three libraries you can find the code attached.
this is compiled with -O2
# simple sum of 1000000 Word8 elements
Unboxed Vector 1.114060 ms Storable Vector 795.1207 us Primitive Vector 1.116145 ms
ByteString 9.076256 ms
array library has no fold or sum function
# simple sum of 1000000 more or less randomly chosen elements
Unboxed Vector (unsafe) 33.74364 ms Storable Vector (unsafe) 50.27273 ms Storable Vector (safe) 67.01634 ms Primitive Vector (unsafe) 56.29919 ms
ByteString (unsafe) 19.29611 ms ByteString (safe) 18.29065 ms
UArray (safe) 46.88719 ms unsafe does not exist for array
So Unboxed can be better than Storable but doesn't need to be. Also, which implementation is faster depends very much on the problem at hand. And array is just missing half the needed features.
Silvio
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe array does provide folding functions, found in its Foldable and Traversable instances.

On 05/06/13 07:01, silvio wrote:
array does provide folding functions, found in its Foldable and Traversable instances.
Where can I find this? I can neither in the array package nor with google nor with hoogle.
Silvio
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe Data.Foldable and Data.Traversable, if you hoogle "Foldable" or "Traversable" you'll find their modules' docs.

Hi Tom,
On Tue, Jun 04, 2013 at 04:01:37PM +0200, Peter Simons wrote:
How is this a problem?
If you're representing text, use 'text'. If you're representing a string of bytes, use 'bytestring'. If you want an "array" of values, think c++ and use 'vector'.
the problem is that all those packages implement the exact same data type from scratch, instead of re-using an implementation of a general-purpose array internally. That is hardly desirable, nor is it necessary.
Just to clarify for those on the sidelines, the issue is duplication of implementation details, rather than duplication of functionality?
I am not sure what the terms "duplication of implementation details" and "duplication of functionality" mean in this context. Could you please explain how these two concepts differ in your opinion? Take care, Peter

On Tue, Jun 04, 2013 at 11:23:16PM +0200, Peter Simons wrote:
On Tue, Jun 04, 2013 at 04:01:37PM +0200, Peter Simons wrote:
If you're representing text, use 'text'. If you're representing a string of bytes, use 'bytestring'. If you want an "array" of values, think c++ and use 'vector'.
the problem is that all those packages implement the exact same data type from scratch, instead of re-using an implementation of a general-purpose array internally. That is hardly desirable, nor is it necessary.
Just to clarify for those on the sidelines, the issue is duplication of implementation details, rather than duplication of functionality?
I am not sure what the terms "duplication of implementation details" and "duplication of functionality" mean in this context. Could you please explain how these two concepts differ in your opinion?
Hi Peter, When I say "duplication of implementation details" I believe I mean something like your implementing "the exact same data type from scratch". By "duplication of functionality", on the other hand, I mean providing two libraries with similar APIs which essentially serve the same purpose. I believe you are suggesting that there is redundancy in the implementation details of these libraries, not in the APIs they expose. Then again, I was just trying to understand the discussion at hand. I don't have an opinion on it. Tom

Hi Tom, thank you for the explanation.
I believe you are suggesting that there is redundancy in the implementation details of these libraries, not in the APIs they expose.
I meant to say that there is redundancy in *both*. The libraries mentioned in this thread re-implement the same type internally and expose APIs to the user that are largely identical. Take care, Peter

On 5 June 2013 11:50, Peter Simons
I meant to say that there is redundancy in *both*. The libraries mentioned in this thread re-implement the same type internally and expose APIs to the user that are largely identical.
I agree. I hope that ByteStrings will be replaced by a Storable.Vector of Word8 at some point in the future. To make the transition easier I have an experimental library which defines a ByteString as a type synonym of a Storable.Vector of Word8 and provides the same interface as the bytestring package: https://github.com/basvandijk/vector-bytestring It includes a comprehensive benchmark suite which compares it to bytestring. IIRC some functions are way faster in vector than their bytestring equivalents and they have the potential to fuse. However some functions are still way slower so more work has to be done in vector to beat bytestring completely. Bas

To make the transition easier I have an experimental library which defines a ByteString as a type synonym of a Storable.Vector of Word8 and provides the same interface as the bytestring package:
That's interesting Bas. What bothers me about ByteStrings is that they need to be "pinned" inside the heap, preventing the GC from collecting them. This is more than an issue, I think, if a program uses them massively and they needs to be allocated persistently (example: a long-life constant). I know is still a marginal case, but knowing that a part of my heap is pinned makes my sleep quality degrade :( I assume that working with vector remove the problem, correct? A.

On 10 July 2013 08:57, Alfredo Di Napoli
To make the transition easier I have an experimental library which defines a ByteString as a type synonym of a Storable.Vector of Word8 and provides the same interface as the bytestring package:
That's interesting Bas. What bothers me about ByteStrings is that they need to be "pinned" inside the heap, preventing the GC from collecting them.
Being "pinned" doesn't prevent an object from being garbage collected. It just means that the GC won't move the object around so that foreign code can reliably reference the object while the GC is running: http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/GC/Pinned
I assume that working with vector remove the problem, correct?
There wasn't a problem in the first but note that a Storable Vector is implemented in the same way as a ByteString: a ForeignPtr and a length* I hope I have now improved your sleep quality ;-) Cheers, Bas * A ByteString also contains an offset but vector modifies the pointer in the ForeignPtr instead so we safe an Int there.

Hello Bas,
sorry for being unclear. What you say is correct, I was referring (and I
realised this after posting :D ) that the real
annoying thing is fragmentation in memory. Due to the fact the GC can't
move those objects, if we have long running
processes our memory will become more and more fragmented, correct? :(
A.
On 10 July 2013 08:25, Bas van Dijk
On 10 July 2013 08:57, Alfredo Di Napoli
wrote: To make the transition easier I have an experimental library which defines a ByteString as a type synonym of a Storable.Vector of Word8 and provides the same interface as the bytestring package:
That's interesting Bas. What bothers me about ByteStrings is that they
need
to be "pinned" inside the heap, preventing the GC from collecting them.
Being "pinned" doesn't prevent an object from being garbage collected. It just means that the GC won't move the object around so that foreign code can reliably reference the object while the GC is running:
http://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/GC/Pinned
I assume that working with vector remove the problem, correct?
There wasn't a problem in the first but note that a Storable Vector is implemented in the same way as a ByteString: a ForeignPtr and a length*
I hope I have now improved your sleep quality ;-)
Cheers,
Bas
* A ByteString also contains an offset but vector modifies the pointer in the ForeignPtr instead so we safe an Int there.
participants (12)
-
Alfredo Di Napoli
-
Artyom Kazak
-
Bas van Dijk
-
Ben Gamari
-
briand@aracnet.com
-
Carter Schonwald
-
Clark Gaebel
-
Jason Dagit
-
Mike Ledger
-
Peter Simons
-
silvio
-
Tom Ellis