Re: [Haskell] ANNOUNCE: FPS - FastPackedStrings 0.2

Hello Donald, Wednesday, April 19, 2006, 1:23:25 PM, you wrote:
I'm pleased to announce version 0.2 of FPS, the fast, packed string library for Haskell.
i want to write some critique about current state of your lib. something of this may be interesting only for me 1. it don't support Unicode. there are at least two libs (Simon's and from JHC) that uses UTF-8 to do this. of course, they will be not so efficient on some operations. i think that it is essential to general-purpose library and Data.PackedString replacement. Simon's lib already implements utf-8, latin-1, ucs-2 and ucs-4 encoding. may be it's possible to join them all together in one lib that uses prerocessing or some other technique to implement differences between utf-8 and fixed-width encoding 2. it uses ForeignPtr, that is slow in ghc 6.4 and require more memory than ByteArray# in any ghc version. in addition, you place two Ints here. it's question of taste, of course - whether we want to have better speed or memory requirements. i personally prefer to have minimal memory overhead that will be accomplished by using ByteArray# plus one Int to hold string's size. it's possible to write lib so it will use any unboxed array and this will work both with ForeignPtr and ByteArray# implementations. It will also allow to implement general list-like interface for any immutable arrays what should be useful, in particular for collections framework that is now written by JP Bernardy. of course, using general Array instead of ForeignPtr will make some functions (such as mmap) unimplementable for common case 3. i'm also have two libs that is intended to be included in GHC 6.6. one of them is reworking of Array facilities, another replaces the Handles functionality. there are some common interests between us here - adding common ListEmulation interface to Arrays, then implementing FastStrings on top of it, then implementing really fast I/O for these FastStrings. we definitely should join our work at some moment to provide consistent features set for ghc 6.6. currently i can say that i also tried to implement fast I/O using packed strings from your lib and from jhc's module and - yes, i found that without prior `findEOL` it's hard to make it fast with your (ForeignPtr-based) implementation. it seems that you encounter the same problem disclaimer: i know that i say about huge changes in the library structure, so it's just thoughts on this issue. the only really required thing is to make my streams I/O work efficient with your lib if both libs will go into 6.6 -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Wed, Apr 19, 2006 at 06:04:58PM +0400, Bulat Ziganshin wrote:
1. it don't support Unicode. there are at least two libs (Simon's and from JHC) that uses UTF-8 to do this. of course, they will be not so efficient on some operations. i think that it is essential to general-purpose library and Data.PackedString replacement. Simon's lib already implements utf-8, latin-1, ucs-2 and ucs-4 encoding. may be it's possible to join them all together in one lib that uses prerocessing or some other technique to implement differences between utf-8 and fixed-width encoding
Indeed. I was excited about the prospect of using FastPackedString until I saw it didn't support the full character range. that is too bad. I'd recommend just always using utf8 under the hood (since it shouldn't matter what encoding is used internally) and have two integers stored with the pointer, the number of bytes and the number of characters. when these are the same you know you have straight ASCII, plus it gives you O(1) length for free. I have very optimized utf8 fold operators in the jhc version of PackedString you can steal. they get speed by assuming the data is always valid UTF8 so don't do error checking, which the constructors always enforce. (yay for ADTs) John -- John Meacham - ⑆repetae.net⑆john⑈

John Meacham
I'd recommend just always using utf8 under the hood
Or have two cases of the representation: an array of bytes if every character is U+00FF or below, or an array of 32-bit words otherwise. My Kogut compiler does that internally, and makes use of that when interfacing with C (the default encoding is assumed to preserve ASCII; if the narrow case has only ASCII chars, a pointer to its internals is passed to a C function; for this reason it has an otherwise unused '\0' after the last character). AFAIK CLISP has 16-bit words as the third case. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

On Thu, Apr 20, 2006 at 12:47:49AM +0200, Marcin 'Qrczak' Kowalczyk wrote:
I'd recommend just always using utf8 under the hood
Or have two cases of the representation: an array of bytes if every character is U+00FF or below, or an array of 32-bit words otherwise.
The complexity of multiple cases and encodings never seemed worth it to me. the code gets bigger and you have to have switches depending on the representation that slows things down. Just plain old utf8 always seems the best for a FastPackedString library at least. But others opinions differ on the matter. John -- John Meacham - ⑆repetae.net⑆john⑈

Hello John, Thursday, April 20, 2006, 3:03:23 AM, you wrote:
On Thu, Apr 20, 2006 at 12:47:49AM +0200, Marcin 'Qrczak' Kowalczyk wrote:
I'd recommend just always using utf8 under the hood
Or have two cases of the representation: an array of bytes if every character is U+00FF or below, or an array of 32-bit words otherwise.
The complexity of multiple cases and encodings never seemed worth it to me. the code gets bigger and you have to have switches depending on the representation that slows things down. Just plain old utf8 always seems the best for a FastPackedString library at least. But others opinions differ on the matter.
i'm not sure. utf-8 encoding has his own drawbacks (i mean complexity of code and slowness). moreover if Donald will use Simon's technique for implementing ucs1..ucs4 packed strings, this module will just call existing functions. one problem, though, is what we will lose utf-8 support for mapped files and any other C strings -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello John, Thursday, April 20, 2006, 1:52:53 AM, you wrote:
Indeed. I was excited about the prospect of using FastPackedString until I saw it didn't support the full character range. that is too bad.
this lib should be slower than your on small strings due to ForeignPtr inefficiency and i think that Donald should mention in doc/announcement that his lib is latin-1 only. it's not good that each of us should scan his sources to rediscover this fact -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

bulat.ziganshin:
Hello John,
Thursday, April 20, 2006, 1:52:53 AM, you wrote:
Indeed. I was excited about the prospect of using FastPackedString until I saw it didn't support the full character range. that is too bad.
this lib should be slower than your on small strings due to ForeignPtr inefficiency
Do you have some sketches of ByteArray# represenations? I'm very happy to change the representation if it makes things more time or space efficient. -- Don

Hello Donald, Thursday, April 20, 2006, 1:06:38 PM, you wrote:
this lib should be slower than your on small strings due to ForeignPtr inefficiency
Do you have some sketches of ByteArray# represenations? I'm very happy to change the representation if it makes things more time or space efficient.
you should see at the jhc's PackedString implementation that uses UArray (which is internally ByteArray#). his implementation also don't use additional Ints - array size is stored as part of UArray structure and O(1) substr operation is just not exists (instead, it is O(n) with copying) as i said, this can be extended to support List-like interface for ANY arrays, that is the very useful thing. if you will implement this, then we will only need utf-8 wrapper around UArray Int Word8 for implementing packed Unicode strings (and Latin1/UCS-2/UCS-4 strings will work just as unboxed arrays of Word8/Word16/Word32) you can compare speed and memory requirements of ForeignPtr and ByteArray# solutions by comparing jhc's module with Simon's utf-8 implementation - both are not such greatly optimized as your lib, both uses utf-8. please use large number of typical strings (with size of 10-100) for real-world comparison, comparing on 20mb strings is not indicative i can write disk traversing algorithm (like ls-lR) for some sort of real app, to make testing more realistic -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 20.04 10:52, Bulat Ziganshin wrote:
this lib should be slower than your on small strings due to ForeignPtr inefficiency
Actually having O(1) substrings is very nice and can improve performance quite a lot. Another feature is the easy integration with low level libraries that want a Ptr for input/output.
and i think that Donald should mention in doc/announcement that his lib is latin-1 only. it's not good that each of us should scan his sources to rediscover this fact
Actually I am using fps with UTF8 and no problems. The trick is that I care about substrings rather than invidual characters. Usually one ends up doing all the splitting etc on ascii characters and the rest are handled as substrings where character boundaries are meaningless. We can use the UTF8 strings on multiple levels: 1) just bytes + ascii character matching 2) match physical unicode characters one by one 3) match unicode substrings I would argue that in many cases either 1) or 3) is what is really wanted. Composite characters and combining marks make 2) problematic. FPS does 1) quite well and it should be feasible to build separate modules providing 2) or 3) on top of it. Haskell does not support full range of unicode characters for meaningful operations. One cannot do IO with the standard libraries with Chars outside the Latin-1 range. - Einar Karttunen

On Wed, Apr 19, 2006 at 02:52:53PM -0700, John Meacham wrote:
I'd recommend just always using utf8 under the hood (since it shouldn't matter what encoding is used internally) and have two integers stored with the pointer, the number of bytes and the number of characters. when these are the same you know you have straight ASCII, plus it gives you O(1) length for free. I have very optimized utf8 fold operators in the jhc version of PackedString you can steal. they get speed by assuming the data is always valid UTF8 so don't do error checking, which the constructors always enforce. (yay for ADTs)
Nice trick, but you loose constant time indexing operations, don't you? Best regards Tomasz

bulat.ziganshin:
Hello Donald,
Wednesday, April 19, 2006, 1:23:25 PM, you wrote:
I'm pleased to announce version 0.2 of FPS, the fast, packed string library for Haskell.
i want to write some critique about current state of your lib. something of this may be interesting only for me
1. it don't support Unicode. there are at least two libs (Simon's and from JHC) that uses UTF-8 to do this. of course, they will be not so efficient on some operations. i think that it is essential to general-purpose library and Data.PackedString replacement. Simon's lib already implements utf-8, latin-1, ucs-2 and ucs-4 encoding. may be it's possible to join them all together in one lib that uses prerocessing or some other technique to implement differences between utf-8 and fixed-width encoding
I plan to do that at some point, yes, following the structure of Simon's lib. It should be fairly easy to merge Simon's UTF code with fps, as the data structure is the same.
2. it uses ForeignPtr, that is slow in ghc 6.4 and require more memory than ByteArray# in any ghc version. in addition, you place two Ints here. it's question of taste, of course - whether we want to have better speed or memory requirements. i personally prefer to have minimal memory overhead that will be accomplished by using ByteArray# plus one Int to hold string's size. it's possible to write lib so it will use any unboxed array and this will work both with ForeignPtr and ByteArray# implementations. It will also allow to implement general list-like interface for any immutable arrays what should be useful, in particular for collections framework that is now written by JP Bernardy. of course, using general Array instead of ForeignPtr will make some functions (such as mmap) unimplementable for common case
That's quite a big change. I'd like to see a prototype first, and some numbers. As SPJ sometimes says: "show me the code". Cheers, Don

Donald Bruce Stewart wrote:
bulat.ziganshin:
2. it uses ForeignPtr, that is slow in ghc 6.4 and require more memory than ByteArray# in any ghc version. in addition, you place two Ints here. it's question of taste, of course - whether we want to have better speed or memory requirements. i personally prefer to have minimal memory overhead that will be accomplished by using ByteArray# plus one Int to hold string's size. it's possible to write lib so it will use any unboxed array and this will work both with ForeignPtr and ByteArray# implementations. It will also allow to implement general list-like interface for any immutable arrays what should be useful, in particular for collections framework that is now written by JP Bernardy. of course, using general Array instead of ForeignPtr will make some functions (such as mmap) unimplementable for common case
That's quite a big change. I'd like to see a prototype first, and some numbers. As SPJ sometimes says: "show me the code".
Optimising the library for GHC 6.4.x isn't a goal (for me, at least). I'm convinced that ForeignPtr is the right choice. No other representation can give you all of these: - converting arbitrary Ptrs into PackedStrings without copying, including finalizers if necessary (useful for mmap()). - garbage-collectable PackedStrings can be allocated on the heap, no finalizer necessary - PackedStrings can be passed to foreign functions without copying ByteArray# only gives you the second two. You would need two variants of the representation to get all three, and that means extra overhead. Cheers, Simon

Hello Simon, Thursday, April 20, 2006, 1:23:42 PM, you wrote:
it's possible to write lib so it will use any unboxed array and this will work both with ForeignPtr and ByteArray# implementations
- converting arbitrary Ptrs into PackedStrings without copying, including finalizers if necessary (useful for mmap()).
- garbage-collectable PackedStrings can be allocated on the heap, no finalizer necessary
- PackedStrings can be passed to foreign functions without copying
ByteArray# only gives you the second two. You would need two variants of the representation to get all three, and that means extra overhead.
Simon, you skipped the whole idea - the lib should work with any array, including StorableArray too. it should be just Array->ListLike interface converter, with additional code specialized for unmovable arrays. plus UTF-8 support at the top of this: module ArrayToListLike class ListLike head tail ... instance (Array a e) => ListLike (a e) module UTF8PackedString import ArrayToListLike newtype UTF8PackedString = UTF8PackedString (UArray Int Word8) instance ListLike UTF8PackedString well, i'm not sure that this can be implemented with classes. but at least this can be implemented with preprocessor (that is Haskell's module system ;) just like yourself implemented packed strings for 1/2/4-byte wide representations. here, array type and array elements type will be parameters: #define ARRAY UArray #define ELEMENT Word8 #include "Str.hs" #define ARRAY StorableArray #define ELEMENT Word8 #include "Str.hs"
Optimising the library for GHC 6.4.x isn't a goal (for me, at least).
1. i think that we will have really stable release of 6.6 at Fall'2007, not earlier 2. ForeignPtr has the following drawbacks even for 6.6: - need more memory. main goal of using packed string is to reduce memory requirements so adding 10-15 bytes is bad idea - can point only to the unmovable memory blocks that may increase overall memory consumption for some apps. general Array can work with any memory - movable heap, unmovable heap, or C-allocated -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On the topic of a good name for FPS, I think its fairly widely considered that a packed byte "string" shouldn't be considered a String, and thus FastPackedString is a potentially confusing misnomer. It was always meant as a working title until something replaced Data.PackedString anyway. If this library is to be imported into the base libraries, along with future PackedString.Unicode and so on, layers on top, we should probably get the name right now. I'm disinclined to call it a ByteArray module -- it doesn't really offer array-like operations. And its got nothing much to do with the other Array.* stuff. Instead, how about: Data.ByteString (with a connotation of IntMap) ? That seems to suggest both the stringy-ness of the api, but also the restriction to bytes. -- Don

I agree with Don about the current name. If FastPackString is confusing because it isn't a string, doesn't ByteString suffers from the same problem? Seth On Thu, 20 Apr 2006 13:26:27 +1000 dons@cse.unsw.edu.au (Donald Bruce Stewart) wrote:
On the topic of a good name for FPS, I think its fairly widely considered that a packed byte "string" shouldn't be considered a String, and thus FastPackedString is a potentially confusing misnomer. It was always meant as a working title until something replaced Data.PackedString anyway.
If this library is to be imported into the base libraries, along with future PackedString.Unicode and so on, layers on top, we should probably get the name right now.
I'm disinclined to call it a ByteArray module -- it doesn't really offer array-like operations. And its got nothing much to do with the other Array.* stuff.
Instead, how about: Data.ByteString (with a connotation of IntMap) ?
That seems to suggest both the stringy-ness of the api, but also the restriction to bytes.
-- Don _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

seth:
I agree with Don about the current name. If FastPackString is confusing because it isn't a string, doesn't ByteString suffers from the same problem?
I'm not sure if you're agreeing with me Seth :) I do think FastPackedString is not a suitable for a standard Haskell library as long as it only works on byte strings. So, to make crystal clear what the story is, I propose ByteString, following IntMap, to state this. I think its important to keep 'String' in the name, since its a string-mangling library.
Seth
On Thu, 20 Apr 2006 13:26:27 +1000 dons@cse.unsw.edu.au (Donald Bruce Stewart) wrote:
On the topic of a good name for FPS, I think its fairly widely considered that a packed byte "string" shouldn't be considered a String, and thus FastPackedString is a potentially confusing misnomer. It was always meant as a working title until something replaced Data.PackedString anyway.
If this library is to be imported into the base libraries, along with future PackedString.Unicode and so on, layers on top, we should probably get the name right now.
I'm disinclined to call it a ByteArray module -- it doesn't really offer array-like operations. And its got nothing much to do with the other Array.* stuff.
Instead, how about: Data.ByteString (with a connotation of IntMap) ?
That seems to suggest both the stringy-ness of the api, but also the restriction to bytes.
-- Don _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

dons:
I'm not sure if you're agreeing with me Seth :)
I do think FastPackedString is not a suitable for a standard Haskell library as long as it only works on byte strings.
*cough* "not a suitable _name_ for a standard Haskell library..."

Donald Bruce Stewart wrote:
I do think FastPackedString is not a suitable for a standard Haskell library as long as it only works on byte strings.
How about making it polymorphic then? It could work on everything Storable, with a bunch of specialized funktions for the common Word8/Word16/Word32 and Char variants. Udo. -- When someone says "I want a programming language in which I need only say what I wish done," give him a lollipop.

On Wed, 19 Apr 2006, Seth Kurtzberg wrote:
I agree with Don about the current name. If FastPackString is confusing because it isn't a string, doesn't ByteString suffers from the same problem?
I think more accurately FPS is confusing because most people assume string = string of characters, whereas FPS is a string of bytes? -- flippa@flippac.org Ivanova is always right. I will listen to Ivanova. I will not ignore Ivanova's recomendations. Ivanova is God. And, if this ever happens again, Ivanova will personally rip your lungs out!

Hello Donald, Thursday, April 20, 2006, 7:26:27 AM, you wrote:
Instead, how about: Data.ByteString (with a connotation of IntMap) ?
Data.Str and type Str imho is a best thing. but you should support utf-8. Data.Str.Latin1 .. Data.Str.UCS4 may be used for other forms of packed strings -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

How about ByteSequence ?
This is inline with Ross' Data.Sequence and Data.IntMap.
Cheers,
JP.
On 4/20/06, Donald Bruce Stewart
On the topic of a good name for FPS, I think its fairly widely considered that a packed byte "string" shouldn't be considered a String, and thus FastPackedString is a potentially confusing misnomer. It was always meant as a working title until something replaced Data.PackedString anyway.
If this library is to be imported into the base libraries, along with future PackedString.Unicode and so on, layers on top, we should probably get the name right now.
I'm disinclined to call it a ByteArray module -- it doesn't really offer array-like operations. And its got nothing much to do with the other Array.* stuff.
Instead, how about: Data.ByteString (with a connotation of IntMap) ?
That seems to suggest both the stringy-ness of the api, but also the restriction to bytes.
-- Don _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Data.ByteVector? But I would expect a library called Data.Byte* to deal in terms of Word8s, not Chars. The objective of the operation is to provide a speedier [Char], so I'd be happy with Data.PackedString(.Latin1). Perhaps the ideal structure would be something like (taking up Udo Stenzel's suggestion): Data.StorableVector - data {- Storable w => -} StorableVector w - implements all the FPS operations - convert between ForeignPtr, StorableVector, StorableArray Data.PackedString.Latin1 Data.PackedString.UTF8 - newtype PackedString = PS (StorableVector Word8) Making all the appropriate specialisations happen might be a challenge, though. An intermediate position is to provide Data.ByteVector and Data.PackedString.* defined in terms of that. Cheers, Simon Jean-Philippe Bernardy wrote:
How about ByteSequence ?
This is inline with Ross' Data.Sequence and Data.IntMap.
Cheers, JP.
On 4/20/06, Donald Bruce Stewart
wrote: On the topic of a good name for FPS, I think its fairly widely considered that a packed byte "string" shouldn't be considered a String, and thus FastPackedString is a potentially confusing misnomer. It was always meant as a working title until something replaced Data.PackedString anyway.
If this library is to be imported into the base libraries, along with future PackedString.Unicode and so on, layers on top, we should probably get the name right now.
I'm disinclined to call it a ByteArray module -- it doesn't really offer array-like operations. And its got nothing much to do with the other Array.* stuff.
Instead, how about: Data.ByteString (with a connotation of IntMap) ?
That seems to suggest both the stringy-ness of the api, but also the restriction to bytes.
-- Don _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
participants (11)
-
Bulat Ziganshin
-
dons@cse.unsw.edu.au
-
Einar Karttunen
-
Jean-Philippe Bernardy
-
John Meacham
-
Marcin 'Qrczak' Kowalczyk
-
Philippa Cowderoy
-
Seth Kurtzberg
-
Simon Marlow
-
Tomasz Zielonka
-
Udo Stenzel