bytestring vs. uvector

Hi all, For a while now, we have had Data.ByteString[.Lazy][.Char8] for our fast strings. Now we also have Data.Text, which does the same for Unicode. These seem to be the standard for dealing with lists of bytes and characters. Now we also have the storablevector, uvector, and vector packages. These seem to be also useful for unpacked data, *including* Char and Word8 values. What is the difference between bytestring and these new "fast array" libraries? Are the latter just generalizations of the former? Thanks for any insight anyone can give on this. Alex

On Sat, Mar 7, 2009 at 10:23 PM, Alexander Dunlap < alexander.dunlap@gmail.com> wrote:
Hi all,
For a while now, we have had Data.ByteString[.Lazy][.Char8] for our fast strings. Now we also have Data.Text, which does the same for Unicode. These seem to be the standard for dealing with lists of bytes and characters.
Now we also have the storablevector, uvector, and vector packages. These seem to be also useful for unpacked data, *including* Char and Word8 values.
What is the difference between bytestring and these new "fast array" libraries? Are the latter just generalizations of the former?
There are quite a few overlaps and differences among them. bytestring is mature and useful for low-level byte buffer manipulations, and also for efficient I/O. This is in part because it uses pinned pointers that can interoperate easily with foreign code. It used to have an early fusion rewriting framework, but that was abandoned. So it will not fuse multiple ByteString traversals into single loops. This library is widely used, and also somewhat abused for text I/O. storablevector is not mature (I'm not even sure if it's actually used) and is a derivative of an old version of the bytestring library, and so has similar characteristics for interacting with foreign code. It contains some old fusion code that is sketchy in nature and somewhat likely to be broken. I'm not sure I would recommend using this library. uvector is, if my memory serves me correctly, a fork of the vector library. It uses modern stream fusion, but is under active development and is a little scary. I'm a little unclear on the exact difference between uvector and vector. Both use arrays that are not pinned, so they can't be readily used with foreign code. If you want to use either library, understand that you're embarking on a bracing adventure. text is not mature, and is based on the same modern fusion framework as uvector and vector. It uses unpinned arrays, but provides functions for dealing with foreign code. It uses a denser encoding than uvector for text, and provides text-oriented functions like splitting on word and line boundaries. Although it's intended for use with Unicode text, it does not yet provide proper Unicode-aware functions for things like case conversion. It interacts with bytestring to perform conversion to and from standard representations like UTF-8, and (via the text-icu package) ICU for others (SJIS, KOI-8, etc). If you want to use this library, understand that you're embarking on a bracing adventure.

Excerpts from Bryan O'Sullivan's message of Sun Mar 08 00:45:03 -0600 2009:
uvector is, if my memory serves me correctly, a fork of the vector library. It uses modern stream fusion, but is under active development and is a little scary. I'm a little unclear on the exact difference between uvector and vector. Both use arrays that are not pinned, so they can't be readily used with foreign code. If you want to use either library, understand that you're embarking on a bracing adventure.
vector and uvector are roughly based on the same technology; uvector is - as far as I remember - a fork of some of the old DPH code which uses stream fusion which Don cleaned up and worked on (and it's proven pretty useful, and people are still hacking on it.) vector however, has the notion of 'recycling arrays' when it does array operations. The technique is in fact quite similar to stream fusion. Roman L. built this from scratch I think, so it's quite a bit more unused and less stable than even uvector is maybe, but I suppose you could say it's kind of a superset of uvector. Hopefully though it should mature a little, and the plans are to have the technology from both of these folded into the Data Parallel Haskell project so we get fast array operations+automatic parallelisation. For info, see Roman's paper, 'Recycle your arrays!' http://www.cse.unsw.edu.au/~rl/publications/recycling.html Austin

uvector is, if my memory serves me correctly, a fork of the vector library. It uses modern stream fusion, but is under active development and is a little scary. I'm a little unclear on the exact difference between uvector and vector. Both use arrays that are not pinned, so they can't be readily used with foreign code. If you want to use either library, understand that you're embarking on a bracing adventure.
vector and uvector are roughly based on the same technology; uvector is - as far as I remember - a fork of some of the old DPH code which uses stream fusion which Don cleaned up and worked on (and it's proven pretty useful, and people are still hacking on it.)
vector however, has the notion of 'recycling arrays' when it does array operations. The technique is in fact quite similar to stream fusion. Roman L. built this from scratch I think, so it's quite a bit more unused and less stable than even uvector is maybe, but I suppose you could say it's kind of a superset of uvector. Hopefully though it should mature a little, and the plans are to have the technology from both of these folded into the Data Parallel Haskell project so we get fast array operations+automatic parallelisation.
For info, see Roman's paper, 'Recycle your arrays!'
Given the close relationship between uvector and vector, it would be very helpful if both package descriptions on hackage could point to a common haskell wiki page, starting out with the text and link above, plus a link to the stream fusion paper (I hadn't been aware that vector incorporates the recycling work, and had often wondered about the precise relationship between those two packages). Apart from saving others from similar confusion, that would also provide a place to record experience with those two alternatives. Btw, have any of the Haskell array optimization researchers considered fixpoints yet? Both fusion and recycling are based on rewrite rules of the kind "in . out --> id". Now, given a loop like this: loop a = if c a then loop (out (action (in a))) else a loop a these rules don't apply. Unrolling the loop a fixed number of times would enable some rule applications, but still some would remain in the loop body. But with a little rewriting loop a = if c a then loop (out (action (in a))) else out (id (in a)) loop a loop a = if c a then loop (out (action (in a))) else out (id (in a)) (if c a then loop (out (action (in a))) else out (id (in a))) we can now push the out into the next iteration of the loop or, if there is no next iteration, into the loop epilogue loop a = if c (out a) then loop (action (in (out a))) else id (in (out a)) out (if c a then loop (action (in a)) else a) making the rewrite rule applicable loop a = if c (out a) then loop (action a) else id a out (if c a then loop (action (in a)) else a) leading (modulo bugs, omissions, and oversights;-) to a fused/ recycled loop body, with potentially substantial benefit. Claus

On 09/03/2009, at 11:47, Claus Reinke wrote:
Btw, have any of the Haskell array optimization researchers considered fixpoints yet?
This, for instance, is a very nice paper: http://www.pllab.riec.tohoku.ac.jp/~ohori/research/OhoriSasanoPOPL07.pdf However, in the context of high-performance array programming explicit recursion is bad because it is very hard if not impossible to parallelise automatically except in fairly trivial cases. And if your array program is not parallelisable then you don't really care about performance all that much :-) Roman

On Mon, 9 Mar 2009, Claus Reinke wrote:
Given the close relationship between uvector and vector, it would be very helpful if both package descriptions on hackage could point to a common haskell wiki page, starting out with the text and link above, plus a link to the stream fusion paper (I hadn't been aware that vector incorporates the recycling work, and had often wondered about the precise relationship between those two packages). Apart from saving others from similar confusion, that would also provide a place to record experience with those two alternatives.
I have at least started a page which mentions the existing alternatives: http://www.haskell.org/haskellwiki/Storable_Vector

On Mon, Mar 9, 2009 at 3:12 AM, Henning Thielemann
On Mon, 9 Mar 2009, Claus Reinke wrote:
Given the close relationship between uvector and vector, it would be very helpful if both package descriptions on hackage could point to a common haskell wiki page, starting out with the text and link above, plus a link to the stream fusion paper (I hadn't been aware that vector incorporates the recycling work, and had often wondered about the precise relationship between those two packages). Apart from saving others from similar confusion, that would also provide a place to record experience with those two alternatives.
I have at least started a page which mentions the existing alternatives: http://www.haskell.org/haskellwiki/Storable_Vector _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Thanks for all of the responses! So let me see if my summary is accurate here: - ByteString is for just that: strings of bytes, generally read off of a disk. The Char8 version just interprets the Word8s as Chars but doesn't do anything special with that. - Data.Text/text library is a higher-level library that deals with "text," abstracting over Unicode details and treating each element as a potentially-multibye "character." - utf8-string is a wrapper over ByteString that interprets the bytes in the bytestring as potentially-multibye unicode "characters." - uvector, storablevector and vector are all designed for dealing with arrays. They *can* be used for characters/word8s but are not specialized for that purpose, do not deal with Unicode at all, and are probably worse at it. They are better for dealing with things that you would generally use arrays for. If that seems accurate, I'll put it on the wiki. Alex

Alexander Dunlap
- uvector, storablevector and vector are all designed for dealing with arrays. They *can* be used for characters/word8s but are not specialized for that purpose, do not deal with Unicode at all, and are probably worse at it. They are better for dealing with things that you would generally use arrays for.
I think uvector only works with certain types that can be unboxed, while storablevector works with all types that instantiate Foreign.Storable.Storable. I don't know about vector. From the description of vector, I have the impression that it is quite unstable. How is it compared to uvector and storablevector? I need one of those to work with my code to possibly improve the efficiency. But I can't use uvector, because I can't use unboxed types. Should I use storablevector? Or vector? Thanks, Xiao-Yong -- c/* __o/* <\ * (__ */\ <

xj2106:
Alexander Dunlap
writes: - uvector, storablevector and vector are all designed for dealing with arrays. They *can* be used for characters/word8s but are not specialized for that purpose, do not deal with Unicode at all, and are probably worse at it. They are better for dealing with things that you would generally use arrays for.
I think uvector only works with certain types that can be unboxed, while storablevector works with all types that instantiate Foreign.Storable.Storable. I don't know about vector. From the description of vector, I have the
That's interesting. I'd expect Storable and UA to have the same set of inhabitants. Is there any difference?
impression that it is quite unstable. How is it compared to uvector and storablevector? I need one of those to work with my code to possibly improve the efficiency. But I can't use uvector, because I can't use unboxed types.
Hmm? If you can write a Storable instance, you can write a UA instance. -- Don

Hello Don, Tuesday, March 10, 2009, 10:40:30 PM, you wrote:
I think uvector only works with certain types that can be unboxed, while storablevector works with all types that instantiate Foreign.Storable.Storable. I don't know about vector. From the description of vector, I have the
That's interesting. I'd expect Storable and UA to have the same set of inhabitants. Is there any difference?
if uavector use ghc's built-in unboxed array operations (as Data.Array.Unboxed does) then it's necessarily bounded to types supported by those operations btw, i have extended this set in my own library by making a class which has implementation using those operations but may be extended further. OTOH, nowadays when ForeignArrays becomes as fast as built-in ones, i don't see any reasons for these built-in operations at all -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello Don,
Tuesday, March 10, 2009, 10:40:30 PM, you wrote:
I think uvector only works with certain types that can be unboxed, while storablevector works with all types that instantiate Foreign.Storable.Storable. I don't know about vector. From the description of vector, I have the
That's interesting. I'd expect Storable and UA to have the same set of inhabitants. Is there any difference?
if uavector use ghc's built-in unboxed array operations (as Data.Array.Unboxed does) then it's necessarily bounded to types supported by those operations
And what is Storable limited to? Ultimately they're all limited to the primops for reading and writing, and to what types we can encode in those. So: primop ReadOffAddrOp_Char "readCharOffAddr#" GenPrimOp primop ReadOffAddrOp_WideChar "readWideCharOffAddr#" GenPrimOp primop ReadOffAddrOp_Int "readIntOffAddr#" GenPrimOp primop ReadOffAddrOp_Word "readWordOffAddr#" GenPrimOp primop ReadOffAddrOp_Addr "readAddrOffAddr#" GenPrimOp primop ReadOffAddrOp_Float "readFloatOffAddr#" GenPrimOp primop ReadOffAddrOp_Double "readDoubleOffAddr#" GenPrimOp primop ReadOffAddrOp_StablePtr "readStablePtrOffAddr#" GenPrimOp primop ReadOffAddrOp_Int8 "readInt8OffAddr#" GenPrimOp primop ReadOffAddrOp_Int16 "readInt16OffAddr#" GenPrimOp primop ReadOffAddrOp_Int32 "readInt32OffAddr#" GenPrimOp primop ReadOffAddrOp_Int64 "readInt64OffAddr#" GenPrimOp primop ReadOffAddrOp_Word8 "readWord8OffAddr#" GenPrimOp primop ReadOffAddrOp_Word16 "readWord16OffAddr#" GenPrimOp primop ReadOffAddrOp_Word32 "readWord32OffAddr#" GenPrimOp primop ReadOffAddrOp_Word64 "readWord64OffAddr#" GenPrimOp {- instance Storable Double instance Storable Bool instance Storable Char instance Storable Int instance Storable Float ... -} {- instance UA () instance (UA a, UA b) => UA (a :*: b) instance UA Bool instance UA Char instance UA Int instance UA Float instance UA Double ... -} So what's a type that's Storable, but not writable in UA (or UArray or ..) -- Don

Don Stewart
And what is Storable limited to?
Ultimately they're all limited to the primops for reading and writing, and to what types we can encode in those. So:
primop ReadOffAddrOp_Char "readCharOffAddr#" GenPrimOp ... {- instance Storable Double instance Storable Bool instance Storable Char instance Storable Int instance Storable Float ... -}
{-
instance UA () instance (UA a, UA b) => UA (a :*: b) instance UA Bool instance UA Char instance UA Int instance UA Float instance UA Double ... -}
So what's a type that's Storable, but not writable in UA (or UArray or ..)
So it's me who understand it wrong. If I want some high performance array with elements of custom data type, I'm stuck with Array, anyway? Is it possible to make instance UA UserDefinedDataType I'm not sure how to do that. Can you give me some clarification? Thanks, Xiao-Yong -- c/* __o/* <\ * (__ */\ <

Hello Xiao-Yong, Tuesday, March 10, 2009, 11:52:50 PM, you wrote:
So it's me who understand it wrong. If I want some high performance array with elements of custom data type, I'm stuck with Array, anyway?
ForeignArray will be the best here. just make you type instance of Storable. if you need *movable* arrays, you will have to implement your own array datatype or use ArrayRef library and declare your type as instance of Unboxed -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

xj2106:
Don Stewart
writes: And what is Storable limited to?
Ultimately they're all limited to the primops for reading and writing, and to what types we can encode in those. So:
primop ReadOffAddrOp_Char "readCharOffAddr#" GenPrimOp ... {- instance Storable Double instance Storable Bool instance Storable Char instance Storable Int instance Storable Float ... -}
{-
instance UA () instance (UA a, UA b) => UA (a :*: b) instance UA Bool instance UA Char instance UA Int instance UA Float instance UA Double ... -}
So what's a type that's Storable, but not writable in UA (or UArray or ..)
So it's me who understand it wrong. If I want some high performance array with elements of custom data type, I'm stuck with Array, anyway?
Is it possible to make
instance UA UserDefinedDataType
I'm not sure how to do that. Can you give me some clarification?
Yes, you can do that. This is the case for most of the new array libraries. -- Don

Don Stewart
instance UA UserDefinedDataType
I'm not sure how to do that. Can you give me some clarification?
Yes, you can do that. This is the case for most of the new array libraries.
It goes beyond my current knowledge, now. How do you define a custom data type as an instance of UA or Storable? Xiao-Yong -- c/* __o/* <\ * (__ */\ <

Hello Xiao-Yong, Wednesday, March 11, 2009, 12:28:45 AM, you wrote:
It goes beyond my current knowledge, now. How do you define a custom data type as an instance of UA or Storable?
just look at existing instances. basically, for complex data type, you just use instances for its basic types, plus you need to calculate offset of second and following fields (using sizeOf in Storable class) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hi, thanks for the hint. I'll see what I can do with it.
Xiao-Yong
Bulat Ziganshin
Hello Xiao-Yong,
Wednesday, March 11, 2009, 12:28:45 AM, you wrote:
It goes beyond my current knowledge, now. How do you define a custom data type as an instance of UA or Storable?
just look at existing instances. basically, for complex data type, you just use instances for its basic types, plus you need to calculate offset of second and following fields (using sizeOf in Storable class)
-- c/* __o/* <\ * (__ */\ <

Hello Don, Tuesday, March 10, 2009, 11:01:31 PM, you wrote:
if uavector use ghc's built-in unboxed array operations (as Data.Array.Unboxed does) then it's necessarily bounded to types supported by those operations
And what is Storable limited to?
Ultimately they're all limited to the primops for reading and writing,
the full story: ghc up to 6.6 has slow access to ForeignArrays, as you may recall therefore, those primitives was added. ByteArra# plus those primitives was the only way to have unboxed arrays with fast access starting with 6.6, ForeignArray access is no-op, so we can just use obvious Ptr operations (via Storable class) to get unboxed arrays fast access. so, no more need for those special ByteArray# access operations but Array library still old, so any effort based on its spources, got the same restrictions also, ByteArray# may be unpinned, but afaik, this isn't really important - it can be coerced to Ptr for the period of one operation -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello Don,
Tuesday, March 10, 2009, 11:01:31 PM, you wrote:
if uavector use ghc's built-in unboxed array operations (as Data.Array.Unboxed does) then it's necessarily bounded to types supported by those operations
And what is Storable limited to?
Ultimately they're all limited to the primops for reading and writing,
the full story:
ghc up to 6.6 has slow access to ForeignArrays, as you may recall
therefore, those primitives was added. ByteArra# plus those primitives was the only way to have unboxed arrays with fast access
starting with 6.6, ForeignArray access is no-op, so we can just use obvious Ptr operations (via Storable class) to get unboxed arrays fast access. so, no more need for those special ByteArray# access operations
but Array library still old, so any effort based on its spources, got the same restrictions
also, ByteArray# may be unpinned, but afaik, this isn't really important - it can be coerced to Ptr for the period of one operation
Right, so my point stands: there's no difference now. If you can write a Storable instance, you can write a UA et al instance. And GHC 6.6 was released what, 11 October 2006? So this has been the case for a long time. -- Don

Hello Don, Wednesday, March 11, 2009, 12:12:07 AM, you wrote:
Right, so my point stands: there's no difference now. If you can write a Storable instance, you can write a UA et al instance.
yes, if there is some class provided for this and not just hard-coded 4 or so base types
And GHC 6.6 was released what, 11 October 2006? So this has been the case for a long time.
unfortunately, Array library unboxed arrays still aren't based on any Unboxable *class* -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello Don,
Wednesday, March 11, 2009, 12:12:07 AM, you wrote:
Right, so my point stands: there's no difference now. If you can write a Storable instance, you can write a UA et al instance.
yes, if there is some class provided for this and not just hard-coded 4 or so base types
That's right. For example (supporting even pairs): instance (RealFloat a, UA a) => UA (Complex a) where newtype UArr (Complex a) = UAComplex (UArr (a :*: a)) newtype MUArr (Complex a) s = MUAComplex (MUArr (a :*: a) s)
And GHC 6.6 was released what, 11 October 2006? So this has been the case for a long time.
unfortunately, Array library unboxed arrays still aren't based on any Unboxable *class*
Hmm. Aren't all the array library types based on MArray and IArray?
So I can define my own say, new STUArray element type by writing an instance of
MArray for it. Like so:
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}
{-# LANGUAGE CPP #-}
-- get at low level representation stuff
import Data.Array.Base
import GHC.IOBase
import GHC.ST ( ST(..), runST )
import GHC.Prim
import GHC.Base
import GHC.Word
import GHC.Ptr
import GHC.Float
import GHC.Stable
import GHC.Int
import GHC.Word
import Data.Array.Unboxed
-- helpers
import Data.Bits
import Text.Printf
import System.Environment
import Control.Monad
-- portable to 32 bit or 64 bit
#include

Hello Don, Wednesday, March 11, 2009, 12:48:35 AM, you wrote:
unfortunately, Array library unboxed arrays still aren't based on any Unboxable *class*
Hmm. Aren't all the array library types based on MArray and IArray?
So I can define my own say, new STUArray element type by writing an instance of MArray for it. Like so:
yes, you can, just this definition duplicates too much code -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello Don,
Wednesday, March 11, 2009, 12:48:35 AM, you wrote:
unfortunately, Array library unboxed arrays still aren't based on any Unboxable *class*
Hmm. Aren't all the array library types based on MArray and IArray?
So I can define my own say, new STUArray element type by writing an instance of MArray for it. Like so:
yes, you can, just this definition duplicates too much code
Ah yes. Then in that case that is indeed true. Array / UArray aren't as extensible as the newer systems modelled on Storable (like UA in vector/uvector).

Don Stewart ha scritto:
bulat.ziganshin:
Hello Don,
Wednesday, March 11, 2009, 12:12:07 AM, you wrote:
Right, so my point stands: there's no difference now. If you can write a Storable instance, you can write a UA et al instance. yes, if there is some class provided for this and not just hard-coded 4 or so base types
That's right. For example (supporting even pairs):
instance (RealFloat a, UA a) => UA (Complex a) where
newtype UArr (Complex a) = UAComplex (UArr (a :*: a)) newtype MUArr (Complex a) s = MUAComplex (MUArr (a :*: a) s)
You also have to add instance for UIO: instance (RealFloat a, UIO a) => UIO (Complex a) where hPutU h (UAComplex arr) = hPutU h arr hGetU h = do arr <- hGetU h return (UAComplex arr) With Storable, this should not be required; you just have to write an instance for the Storable class. Regards Manlio

manlio_perillo:
Don Stewart ha scritto:
bulat.ziganshin:
Hello Don,
Wednesday, March 11, 2009, 12:12:07 AM, you wrote:
Right, so my point stands: there's no difference now. If you can write a Storable instance, you can write a UA et al instance. yes, if there is some class provided for this and not just hard-coded 4 or so base types
That's right. For example (supporting even pairs):
instance (RealFloat a, UA a) => UA (Complex a) where
newtype UArr (Complex a) = UAComplex (UArr (a :*: a)) newtype MUArr (Complex a) s = MUAComplex (MUArr (a :*: a) s)
You also have to add instance for UIO:
instance (RealFloat a, UIO a) => UIO (Complex a) where hPutU h (UAComplex arr) = hPutU h arr hGetU h = do arr <- hGetU h return (UAComplex arr)
With Storable, this should not be required; you just have to write an instance for the Storable class.
Though you get no IO operations with Storable... UIO is entirely separate.

Don Stewart ha scritto:
[...]
You also have to add instance for UIO:
instance (RealFloat a, UIO a) => UIO (Complex a) where hPutU h (UAComplex arr) = hPutU h arr hGetU h = do arr <- hGetU h return (UAComplex arr)
With Storable, this should not be required; you just have to write an instance for the Storable class.
Though you get no IO operations with Storable... UIO is entirely separate.
Yes, but using Storable and Foreign.Ptr, IO is rather simple. From storablevector package: -- | Outputs a 'Vector' to the specified 'Handle'. hPut :: (Storable a) => Handle -> Vector a -> IO () hPut h v = if null v then return () else let (fptr, s, l) = toForeignPtr v in withForeignPtr fptr $ \ ptr -> let ptrS = advancePtr ptr s ptrE = advancePtr ptrS l -- use advancePtr and minusPtr in order to respect -- alignment in hPutBuf h ptrS (minusPtr ptrE ptrS) -- | Read a 'Vector' directly from the specified 'Handle'. This -- is far more efficient than reading the characters into a list -- and then using 'pack'. -- hGet :: (Storable a) => Handle -> Int -> IO (Vector a) hGet _ 0 = return empty hGet h i = createAndTrim i $ \p -> let elemType :: Ptr a -> a elemType _ = undefined sizeOfElem = sizeOf (elemType p) in fmap (flip div sizeOfElem) $ hGetBuf h p (i * sizeOfElem) Regards Manlio

On Tue, 2009-03-10 at 23:55 +0300, Bulat Ziganshin wrote:
starting with 6.6, ForeignArray access is no-op, so we can just use obvious Ptr operations (via Storable class) to get unboxed arrays fast access. so, no more need for those special ByteArray# access operations
but Array library still old, so any effort based on its spources, got the same restrictions
also, ByteArray# may be unpinned, but afaik, this isn't really important - it can be coerced to Ptr for the period of one operation
Actually the fact that they may be unpinned is really important for small allocations like short strings. Pinned allocations are slow and lead to heap fragmentation. Duncan

Don Stewart ha scritto:
[...]
I think uvector only works with certain types that can be unboxed, while storablevector works with all types that instantiate Foreign.Storable.Storable. I don't know about vector. From the description of vector, I have the
One of the nice feature of uvector is the support for UArr (a :*: b). An UArr (a :*: b) can be easily (with fstU and sndU) transformed in UArr a and UArr b. uvector package also suppors Complex and Rational, however the support for these type is "hard written", using a UAProd class, and requires some boiler plate code (IMHO). I find StorableVector implementation much more simple; I would like to see it in the Haskell Platform. As for Data.Parallel, uvector and vector, it seems there is some code duplication. Both Data.Parallel and uvector, make us of a strict pair type. Such a type is also implemented in the strict package [1]. The authors are the same, so I don't understand the reason of code replication. There is also replication in the definition of the Stream data type. [1] there seems to be an error in the documentation: http://hackage.haskell.org/packages/archive/strict/0.3.2/doc/html/Data-Stric... In the description, there is: "Same as regular Haskell pairs, but (x :*: _|_) = (_|_ :*: y) = _|_" but in the synopsis, the data constructor is :!:, not :*:. Regards Manlio Perillo

Manlio Perillo ha scritto:
[...] uvector package also suppors Complex and Rational, however the support for these type is "hard written", using a UAProd class, and requires some boiler plate code (IMHO).
Correction: UAProd is not a class, sorry. It is the UA constructor overloaded for a:*:b, and Complex and Rational just reuse this specialization.
[...]
Manlio

On Mon, 9 Mar 2009, Alexander Dunlap wrote:
- uvector, storablevector and vector are all designed for dealing with arrays. They *can* be used for characters/word8s but are not specialized for that purpose, do not deal with Unicode at all, and are probably worse at it. They are better for dealing with things that you would generally use arrays for.
Since the Storable instance of Char stores 32 bit values, I expect they store full Unicode codes and thus StorableVector can be used for Unicode strings without loss. They will require more memory than UTF8 strings, but whether they are faster or slower will depend on how expensive UTF8 decoding/encoding is.

On Mon, 2009-03-09 at 18:29 -0700, Alexander Dunlap wrote:
Thanks for all of the responses!
So let me see if my summary is accurate here:
- ByteString is for just that: strings of bytes, generally read off of a disk. The Char8 version just interprets the Word8s as Chars but doesn't do anything special with that.
Right. So it's only suitable for binary or ASCII (or mixed) formats.
- Data.Text/text library is a higher-level library that deals with "text," abstracting over Unicode details and treating each element as a potentially-multibye "character."
If you're writing about this on the wiki for people, it's best not to confuse the issue by talking about multibyte anything. We do not describe [Char] as a multibyte encoding of Unicode. We say it is a Unicode string. The abstraction is at the level of Unicode code points. The String type *is* a sequence of Unicode code points. This is exactly the same for Data.Text. It is a sequence of Unicode code points. It is not an encoding. It is not UTF-anything. It does not abstract over Unicode. The Text type is just like the String type but with different strictness and performance characteristics. Both are just sequences of Unicode code points. There is a reasonably close correspondence between Unicode code points and what people normally think of as characters.
- utf8-string is a wrapper over ByteString that interprets the bytes in the bytestring as potentially-multibye unicode "characters."
This on the other hand is an encoding. ByteString is a sequence of bytes and when we interpret that as UTF-8 then we are looking at an encoding of a sequence of Unicode code points. Clear as mud? :-) Duncan

bos:
On Sat, Mar 7, 2009 at 10:23 PM, Alexander Dunlap
wrote: Hi all,
For a while now, we have had Data.ByteString[.Lazy][.Char8] for our fast strings. Now we also have Data.Text, which does the same for Unicode. These seem to be the standard for dealing with lists of bytes and characters.
Now we also have the storablevector, uvector, and vector packages. These seem to be also useful for unpacked data, *including* Char and Word8 values.
What is the difference between bytestring and these new "fast array" libraries? Are the latter just generalizations of the former?
There are quite a few overlaps and differences among them.
bytestring is mature and useful for low-level byte buffer manipulations, and also for efficient I/O. This is in part because it uses pinned pointers that can interoperate easily with foreign code. It used to have an early fusion rewriting framework, but that was abandoned. So it will not fuse multiple ByteString traversals into single loops. This library is widely used, and also somewhat abused for text I/O.
storablevector is not mature (I'm not even sure if it's actually used) and is a derivative of an old version of the bytestring library, and so has similar characteristics for interacting with foreign code. It contains some old fusion code that is sketchy in nature and somewhat likely to be broken. I'm not sure I would recommend using this library.
uvector is, if my memory serves me correctly, a fork of the vector library. It uses modern stream fusion, but is under active development and is a little scary. I'm a little unclear on the exact difference between uvector and vector. Both use arrays that are not pinned, so they can't be readily used with foreign code. If you want to use either library, understand that you're embarking on a bracing adventure.
text is not mature, and is based on the same modern fusion framework as uvector and vector. It uses unpinned arrays, but provides functions for dealing with foreign code. It uses a denser encoding than uvector for text, and provides text-oriented functions like splitting on word and line boundaries. Although it's intended for use with Unicode text, it does not yet provide proper Unicode-aware functions for things like case conversion. It interacts with bytestring to perform conversion to and from standard representations like UTF-8, and (via the text-icu package) ICU for others (SJIS, KOI-8, etc). If you want to use this library, understand that you're embarking on a bracing adventure.
I endorse this message.

On Sat, 7 Mar 2009, Bryan O'Sullivan wrote:
On Sat, Mar 7, 2009 at 10:23 PM, Alexander Dunlap
wrote: Hi all,
For a while now, we have had Data.ByteString[.Lazy][.Char8] for our fast strings. Now we also have Data.Text, which does the same for Unicode. These seem to be the standard for dealing with lists of bytes and characters.
Now we also have the storablevector, uvector, and vector packages. These seem to be also useful for unpacked data, *including* Char and Word8 values.
What is the difference between bytestring and these new "fast array" libraries? Are the latter just generalizations of the former?
storablevector is not mature (I'm not even sure if it's actually used) and is a derivative of an old version of the bytestring library, and so has similar characteristics for interacting with foreign code. It contains some old fusion code that is sketchy in nature and somewhat likely to be broken. I'm not sure I would recommend using this library.
As maintainer of storablevector I can tell that I use it for realtime audio signal processing. Indeed, I expected more of the fusion mechanism than it can do. It is hard to get correct fusion on lazy storablevectors at all, because of non-matching chunk sizes. Thus I didn't follow that path anymore. For audio signal processing I use a list type, like that of the Streams approach. When you convert those streams into storable vectors you get efficient inner loops without any fusion. It would certainly be worth to fuse with those Stream lists, but I haven't tried that so far.

Bryan O'Sullivan ha scritto:
[...] text is not mature, and is based on the same modern fusion framework as uvector and vector. It uses unpinned arrays, but provides functions for dealing with foreign code.
What is the reason why you have decided to use unpinned arrays (ByteArray#) instead of pinned arrays (Foreign.Ptr)?
[...]
Thanks Manlio

manlio_perillo:
Bryan O'Sullivan ha scritto:
[...] text is not mature, and is based on the same modern fusion framework as uvector and vector. It uses unpinned arrays, but provides functions for dealing with foreign code.
What is the reason why you have decided to use unpinned arrays (ByteArray#) instead of pinned arrays (Foreign.Ptr)?
They prevent heap fragmentation (and in general are faster). -- Don

Hello Don, Friday, March 13, 2009, 8:08:57 PM, you wrote:
What is the reason why you have decided to use unpinned arrays (ByteArray#) instead of pinned arrays (Foreign.Ptr)?
They prevent heap fragmentation (and in general are faster).
you probably mean faster alloc/gc operations, everything else should be the same -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Fri, 2009-03-13 at 20:30 +0300, Bulat Ziganshin wrote:
Hello Don,
Friday, March 13, 2009, 8:08:57 PM, you wrote:
What is the reason why you have decided to use unpinned arrays (ByteArray#) instead of pinned arrays (Foreign.Ptr)?
They prevent heap fragmentation (and in general are faster).
you probably mean faster alloc/gc operations, everything else should be the same
Right. Access times are the same. Both are just pointers internally. It is just the allocation time, GC time and extra memory use and lower cache utilisation caused by heap fragmentation. For big arrays it doesn't make much difference. Big ByteArray# allocations get pinned anyway. For small ones, like strings I expect the difference is much more noticeable, though I have not measured it. Using ByteArray# also means we can use ST/runST rather than IO/unsafePerformIO. In general we should prefer heap allocated byte arrays and ST unless we really really want to always use pinned allocations to interact with C libs easily. Duncan

Excerpts from Alexander Dunlap's message of Sun Mar 08 00:23:01 -0600 2009:
For a while now, we have had Data.ByteString[.Lazy][.Char8] for our fast strings. Now we also have Data.Text, which does the same for Unicode. These seem to be the standard for dealing with lists of bytes and characters.
Now we also have the storablevector, uvector, and vector packages. These seem to be also useful for unpacked data, *including* Char and Word8 values.
What is the difference between bytestring and these new "fast array" libraries? Are the latter just generalizations of the former?
Thanks for any insight anyone can give on this.
Alex
Data.Text provides functions for unicode over bytestrings, with several encoding/decoding methods. So, I think that bytestring+text now solves the general problem with the slow String type - we get various international encodings, and fast, efficient packed strings. (It's also worth mentioning utf8-string, which gives you utf8 over bytestrings. text gives you more encodings and is probably still quite efficient, however.) But this is pretty much a separate effort to that of packages like uvector/vector etc. etc.. To clarify, uvector and vector are likely to be merged in the future I think - vector is based on the idea of 'recycling arrays' so that array operations are still very efficient, while uvector only has the tested stream fusion technique behind it. Actually, I think the inevitable plan is to merge the technology behind both vector and uvector into the Data Parallel Haskell project. Array recylcing and stream fusion goes into creating extremely efficient sequential code, while the vectorisation pass turns that into efficient multicore code at the same time. In any case, I suppose that hypothetically if someone wanted to use a package like uvector to create an efficient string type, they could, but if they want that, why not just use bytestring? It's already optimized, battle tested and in extremely wide use. I think some library proliferation is good; in this case, the libraries mentioned here are really for some different purposes, and that's great, because they all lead to some nice, fast code with low conceptual overhead when put together (hopefully...) But I'm not even going to begin examining/comparing the different array interfaces or anything, because that's been done many times here, so you best check the archives if you want the 'in-depth' on the matter. Austin
participants (11)
-
Alexander Dunlap
-
Austin Seipp
-
Bryan O'Sullivan
-
Bulat Ziganshin
-
Claus Reinke
-
Don Stewart
-
Duncan Coutts
-
Henning Thielemann
-
Manlio Perillo
-
Roman Leshchinskiy
-
Xiao-Yong Jin