FPS/Data.ByteString candidate

older
implementation of UTF-8 conversion...

dons＠cse.unsw.edu.au

23 Apr 2006 23 Apr '06

7:27 a.m.

Following discussion, I've tagged FPS 0.4, a candidate for the base library. Changes: * Renamed to Data.ByteString(ByteString) * Improved documentation * Tweaks to build under ghc 6.6 * Added: getLine, getContents, putStr, putStrLn, zip, unzip, zipWith * Much faster: elemIndices, lineIndices, split, replicate * More automagic benchmarks and QuickCheck tests. As usual, code is here: http://www.cse.unsw.edu.au/~dons/fps.html -- Don

Show replies by date

John Meacham

24 Apr 24 Apr

11:31 p.m.

On Sun, Apr 23, 2006 at 05:27:43PM +1000, Donald Bruce Stewart wrote:

...

Following discussion, I've tagged FPS 0.4, a candidate for the base library. Changes:

* Renamed to Data.ByteString(ByteString) * Improved documentation * Tweaks to build under ghc 6.6 * Added: getLine, getContents, putStr, putStrLn, zip, unzip, zipWith * Much faster: elemIndices, lineIndices, split, replicate * More automagic benchmarks and QuickCheck tests.

Can we get rid of every reference to 'Char' in the interface? a search and replace setting them to 'Word8' should do it. Casting between Word8 and Char is just very wrong. a Char based FastString can be built on top of it, but we want to be typesafe in any interface. John -- John Meacham - ⑆repetae.net⑆john⑈

dons＠cse.unsw.edu.au

25 Apr 25 Apr

1:10 a.m.

john:

...

On Sun, Apr 23, 2006 at 05:27:43PM +1000, Donald Bruce Stewart wrote:

...
Following discussion, I've tagged FPS 0.4, a candidate for the base library. Changes:

* Renamed to Data.ByteString(ByteString) * Improved documentation * Tweaks to build under ghc 6.6 * Added: getLine, getContents, putStr, putStrLn, zip, unzip, zipWith * Much faster: elemIndices, lineIndices, split, replicate * More automagic benchmarks and QuickCheck tests.

Can we get rid of every reference to 'Char' in the interface? a search and replace setting them to 'Word8' should do it. Casting between Word8 and Char is just very wrong. a Char based FastString can be built on top of it, but we want to be typesafe in any interface.

Ok. I appreciate this concern. I'll follow Simon Marlow's library here and partition it into, something like: Data.ByteString -- the core ByteString and Word8 operations Data.PackedString.Latin1 -- Char level packed string functions John (and Ashley?) would this be ok? -- Don

dons＠cse.unsw.edu.au

11:12 a.m.

New subject: Data.ByteString candidate 3

dons:

...

john:

...
On Sun, Apr 23, 2006 at 05:27:43PM +1000, Donald Bruce Stewart wrote:

...
Following discussion, I've tagged FPS 0.4, a candidate for the base library. Changes:

* Renamed to Data.ByteString(ByteString) * Improved documentation * Tweaks to build under ghc 6.6 * Added: getLine, getContents, putStr, putStrLn, zip, unzip, zipWith * Much faster: elemIndices, lineIndices, split, replicate * More automagic benchmarks and QuickCheck tests.

Can we get rid of every reference to 'Char' in the interface? a search and replace setting them to 'Word8' should do it. Casting between Word8 and Char is just very wrong. a Char based FastString can be built on top of it, but we want to be typesafe in any interface.

Ok, here's what I've done: http://www.cse.unsw.edu.au/~dons/fps/new/ The code is in the darcs repo: http://www.cse.unsw.edu.au/~dons/fps.html The code has been partioned into: Data.ByteString a Word8 only layer. All functions are in terms of Word8 Data.ByteString.Char provides an ascii/byte-Char layer over the Word8 layer. So essentially this is the Data.ByteString that John and Ashley were looking for, I think, and with a new explicit Char layer, for people like Simon, Einar and me, who need the convenience of literal Chars. This separation means there is now an encoding-agnostic Word8 level of high-performance code, which should be generally useful. I'm quite happy with this now, the code is a lot cleaner, and hopefully this will appease the people who disliked the intermingling of Char and Word8 code, which I agree was unsatisfactory. Opinions? -- Don

Ketil Malde

11:56 a.m.

New subject: Data.ByteString candidate 3

dons@cse.unsw.edu.au (Donald Bruce Stewart) writes: I already voiced this on IRC, but if you'll forgive me, I'll sum up my small minority report.

...

This separation means there is now an encoding-agnostic Word8 level of high-performance code, which should be generally useful.

I'm very happy with the separation, and I think using the Latin-1 charset as the default is the right choice. The only thing I am unhappy about, is the last minute name change, which means that the interpretation as Latin-1 is no longer explicit to the user. A naive user may think that the anonymously named Char module interprets the locale for instance, or might disregard the character set issues entirely, and be confused when a string literal using characters with code points > 255 don't work as expected. In addition, it is natural to extend this with other character sets, but it is no longer obvious where to put sibling modules implementing the same Char functionality with a different (single byte) encoding. Quite frankly, I don't see any advantage of selecting one particular encoding, and then disguise the fact from the user.

...

Opinions?

Well, you did ask. Thanks for the good work, I'm currently benchmarking my programs to check what the current Char IO costs are, and if my suspicions are corroborated, I'll spend some time this week to switch. -k -- If I haven't seen further, it is by standing in the footprints of giants

Duncan Coutts

12:05 p.m.

New subject: Data.ByteString candidate 3

On Tue, 2006-04-25 at 21:12 +1000, Donald Bruce Stewart wrote:

...

Ok, here's what I've done: http://www.cse.unsw.edu.au/~dons/fps/new/

...

I'm quite happy with this now, the code is a lot cleaner, and hopefully this will appease the people who disliked the intermingling of Char and Word8 code, which I agree was unsatisfactory.

Opinions?

It's looking good Don. I think the Word8/Char split was the right way to go. On that theme, there are a few more functions for which we should consider which of the Word8/Char modules would be their best home. If we are saying that the base module is encoding agnostic then I think these functions should probably move to the .Char module since they make assumptions about the encoding of things like '\n'. breakSpace dropSpace dropSpaceEnd lines words unlines unwords lines' unlines' linesCRLF' unlinesCRLF' unwords' lineIndices not sure about: betweenLines Duncan

Simon Marlow

12:08 p.m.

New subject: Data.ByteString candidate 3

Donald Bruce Stewart wrote:

...

The code has been partioned into: Data.ByteString a Word8 only layer. All functions are in terms of Word8 Data.ByteString.Char provides an ascii/byte-Char layer over the Word8 layer.

Ok, but where would we put a UTF8 version of the Char layer? I'm thinking that "Latin1" would be more correct than "Char", and leaves room for adding UTF8 and other encodings later. Cheers, Simon

Duncan Coutts

12:13 p.m.

New subject: Data.ByteString candidate 3

On Tue, 2006-04-25 at 13:08 +0100, Simon Marlow wrote:

...

Donald Bruce Stewart wrote:

...
The code has been partioned into: Data.ByteString a Word8 only layer. All functions are in terms of Word8 Data.ByteString.Char provides an ascii/byte-Char layer over the Word8 layer.

Ok, but where would we put a UTF8 version of the Char layer? I'm thinking that "Latin1" would be more correct than "Char", and leaves room for adding UTF8 and other encodings later.

As others have pointed out, it's not strictly Latin1. Don and I reckon it's probably safe to say that the current Data.ByteString.Char layer is ok for any 8-bit fixed-width encoding with ASCII as a subset, so that means it's probably ok for many of the Latin* encodings. How would we distinguish a full fixed0width 4-byte Unicode version? A purist mgiht say that this should be Data.ByteString.Char since a Char really is a 4-byte Unicode value and then change the current Data.ByteString.Char to be Data.ByteString.Char8 or something like that. Duncan

Duncan Coutts

12:44 p.m.

New subject: Data.ByteString candidate 3

On Tue, 2006-04-25 at 13:13 +0100, Duncan Coutts wrote:

...

On Tue, 2006-04-25 at 13:08 +0100, Simon Marlow wrote:

...
Donald Bruce Stewart wrote:

...
The code has been partioned into: Data.ByteString a Word8 only layer. All functions are in terms of Word8 Data.ByteString.Char provides an ascii/byte-Char layer over the Word8 layer.

Ok, but where would we put a UTF8 version of the Char layer? I'm thinking that "Latin1" would be more correct than "Char", and leaves room for adding UTF8 and other encodings later.

As others have pointed out, it's not strictly Latin1. Don and I reckon it's probably safe to say that the current Data.ByteString.Char layer is ok for any 8-bit fixed-width encoding with ASCII as a subset, so that means it's probably ok for many of the Latin* encodings.

How would we distinguish a full fixed0width 4-byte Unicode version? A purist mgiht say that this should be Data.ByteString.Char since a Char really is a 4-byte Unicode value and then change the current Data.ByteString.Char to be Data.ByteString.Char8 or something like that.

Actually after further discussion we've think that strictly Data.ByteString.Char will only fully work with Latin1 because only for Latin1 will the Chars we get back be genuine Unicode code-points (since the first 256 code points of Unicode are the same as Latin1 - or so I am told). For other Latin encodings what you get back will only be a Unicode code point for chars <127. So for other Latin encodings you'd need different implementations of w2c & c2w that map the 256 chars to/from the correct Unicode code points. So that suggests that we might want to call it Data.ByteString.Latin1. At this point we wish we had parameterisable modules so we could have various other encodings just by parameterising on the w2c/c2w mappings. Most of the time you could use Data.ByteString.Latin1 for other Latin encodings and get away with it (so long as you don't want to use things like isUpper for chars >127) which is both a blessing and a curse. Duncan

Simon Marlow

1:34 p.m.

New subject: Data.ByteString candidate 3

Duncan Coutts wrote:

...

How would we distinguish a full fixed0width 4-byte Unicode version?

Good point, and that's why using the Data.PackedString hierarchy was nice, because it accomodated various different character widths. I quite like Data.ByteString Data.PackedString.Latin1 Data.PackedString.UTF8 Data.PackedString.UCS4 etc. (but this is getting a bit bikeshedish, so I'll try to resist the temptation to comment any further :-) Cheers, Simon

Bulat Ziganshin

6:55 p.m.

New subject: Re[2]: Data.ByteString candidate 3

Hello Simon, Tuesday, April 25, 2006, 5:34:20 PM, you wrote:

...

Good point, and that's why using the Data.PackedString hierarchy was nice, because it accomodated various different character widths. I quite like

...

Data.ByteString Data.PackedString.Latin1 Data.PackedString.UTF8 Data.PackedString.UCS4 etc.

i think these module names are great - first work with just Word8, while Data.PackedString.* modules works with different Char representations -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

John Meacham

8:46 p.m.

New subject: Data.ByteString candidate 3

On Tue, Apr 25, 2006 at 02:34:20PM +0100, Simon Marlow wrote:

...

Duncan Coutts wrote:

...
How would we distinguish a full fixed0width 4-byte Unicode version?

Good point, and that's why using the Data.PackedString hierarchy was nice, because it accomodated various different character widths. I quite like

Data.ByteString Data.PackedString.Latin1 Data.PackedString.UTF8 Data.PackedString.UCS4 etc.

Do we really need all of these? UCS4BE? UTF16? if you care intimatly about the underlying binary representation, then you should be using ByteString directly, since you are working with binary data. if you just want a fast string replacement, then you don't care about the internal representation, you just want it to be fast. We don't want issues where someones library takes UTF8 strings but someone elses takes UCS4 strings and you want them to play nice together. I think all we really need are Data.ByteString Data.PackedString (Though, I suppose Latin1 could be useful) but note, do the people that want latin1 just need ASCII? because it should be noted that if we have a UTF8 PackedString, then we can make ASCII-specific access routines that are just as fast as the ones in the Latin1 variety without giving up the ability to store full unicode values in the string. John -- John Meacham - ⑆repetae.net⑆john⑈

Einar Karttunen

11:16 p.m.

New subject: Data.ByteString candidate 3

On 25.04 13:46, John Meacham wrote:

...

I think all we really need are

Data.ByteString Data.PackedString

(Though, I suppose Latin1 could be useful)

Using the Word8 API is not very pleasant, because all character constants etc are not Word8. As for Latin1 - what semantics do we use for toUpper/toLower and Ord? Using the unicode ones or locale seems the sensible thing if the data really is Latin1. Thus a simple wrapper to the Word8 api is desirable. Make it follow few simple rules: * c2w . w2c = id (conversion is a bijection) * ascii characters translated correctly * toLower/toUpper for ascii * Ord by byte values. This is very useful for many purposes and does not mean that there should not be a fancy UTF8 module. Rather than arguing about killing this, wouldn't it be more productive to create the UTF8 module?

...

but note, do the people that want latin1 just need ASCII? because it should be noted that if we have a UTF8 PackedString, then we can make ASCII-specific access routines that are just as fast as the ones in the Latin1 variety without giving up the ability to store full unicode values in the string.

Case conversions and ordering need to be different. Thus we need to newtype things to avoid having two conflicting Ord instances. The UTF8 layer should provide: * Unicode toUpper/toLower * Unicode collation (UCA) for Ord * Graphemes (see Perl6 for good ways to do this) * Normalisation - Einar Karttunen

John Meacham

26 Apr 26 Apr

12:26 a.m.

New subject: Data.ByteString candidate 3

On Wed, Apr 26, 2006 at 02:16:38AM +0300, Einar Karttunen wrote:

...

Using the Word8 API is not very pleasant, because all character constants etc are not Word8.

yeah, but using the version restricted to latin1 seems rather special case, I can't imagine (or certainly hope) it won't be used in general internally unless people are already doing low level stuff. In this day and age, I expect unicode to work pretty much everywhere.

...

This is very useful for many purposes and does not mean that there should not be a fancy UTF8 module. Rather than arguing about killing this, wouldn't it be more productive to create the UTF8 module?

I am not saying we should kill the latin1 version, since there is interest in it, just that it doesn't fill the need for a general fast string replacement.

...

...
but note, do the people that want latin1 just need ASCII? because it should be noted that if we have a UTF8 PackedString, then we can make ASCII-specific access routines that are just as fast as the ones in the Latin1 variety without giving up the ability to store full unicode values in the string.

Case conversions and ordering need to be different. Thus we need to newtype things to avoid having two conflicting Ord instances. The UTF8 layer should provide:

I don't see why. ascii is a subset of utf8, the routines building a packedstring from an ascii string or a utf8 string can be identical, if you know your string is ascii to begin with you can use an optimized routine but the end result is the same as if you used the general utf8 version.

...

* Unicode toUpper/toLower * Unicode collation (UCA) for Ord * Graphemes (see Perl6 for good ways to do this) * Normalisation

well, none of these are UTF8 specific, we should not worry about the encoding and just think of what 'PackedString' should do, the encoding is unimportant to the API and semantics, the fact that you just happen to be able to quickly convert to/from ascii and utf8 should be the only visible difference in behavior. the proper thing for PackedString is to make it behave exactly as the String instances behave, since it is suposed to be a drop in replacement. Which means the natuarl ordering based on the Char order and the toLower and toUpper from the libraries. uncode collation, graphemes, normalization, and localized sorting can be provided as separate routines as another project (it would be nice to have them work on both Strings and PackedStrings, so perhaps they could be in a class?) certainly a newtype LocalizedPackedString = LocalizedPackedString PackedString with different instances would be a useful thing too. but this should be a separate but related project from just getting a fast string replacement. (as in, it shouldn't hold up PackedString development) John -- John Meacham - ⑆repetae.net⑆john⑈

Einar Karttunen

1:48 a.m.

New subject: Data.ByteString candidate 3

On 25.04 17:26, John Meacham wrote:

...

On Wed, Apr 26, 2006 at 02:16:38AM +0300, Einar Karttunen wrote:

...
Using the Word8 API is not very pleasant, because all character constants etc are not Word8.

yeah, but using the version restricted to latin1 seems rather special case, I can't imagine (or certainly hope) it won't be used in general internally unless people are already doing low level stuff. In this day and age, I expect unicode to work pretty much everywhere.

Like in protocols where some segments may be compressed binary data? And they use ascii character based matching to distinguish header fields, which may have text data that is actually Utf8?

...

I am not saying we should kill the latin1 version, since there is interest in it, just that it doesn't fill the need for a general fast string replacement.

It mostly fills the "I want to use the Word8 module with nicer API" place. But most of the time it may not be Latin1. If we implement a Latin1 module then we should implement it properly. Also if we implement Latin1 there is a case for implementing Latin2-5 also. Of course the people really arguing for this module are not interested in a proper Latin1 implementation but just want the agnostic ascii superset. I think the wishes on the libraries list have been mainly: * UTF8 * Word8 interface * "Ascii superset" The easiest way seems to have three modules - one for each. Then we get to the naming part. I would like: * Data.ByteString.Word8 * Data.ByteString.Char8 * Data.ByteString.UTF And select your favorite and make Data.ByteString export that one. I think that could be the Word8 or the UTF one.

...

I don't see why. ascii is a subset of utf8, the routines building a packedstring from an ascii string or a utf8 string can be identical, if you know your string is ascii to begin with you can use an optimized routine but the end result is the same as if you used the general utf8 version.

Actually toUpper works differently on ascii + something in the high bytes and ISO-8859-1. Same with all the isXXX predicates, fortunately not a problem for things like whitespace.

...

the proper thing for PackedString is to make it behave exactly as the String instances behave, since it is suposed to be a drop in replacement. Which means the natuarl ordering based on the Char order and the toLower and toUpper from the libraries.

toUpper and toLower are the correct version in the standard and they use the unicode tables. The natural ordering by codepoint without any normalization is not very useful for text handling, but works for e.g. putting strings in a Map.

...

uncode collation, graphemes, normalization, and localized sorting can be provided as separate routines as another project (it would be nice to have them work on both Strings and PackedStrings, so perhaps they could be in a class?)

These are quite essential for really working with unicode characters. It didn't matter much before as Haskell didn't provide good ways to handle unicode chars with IO, but these are very important, otherwise it becomes hard to do many useful things with the parsed unicode characters. How are we supposed to process user input without normalization e.g. if we need to compare Strings for equivalence? But a simple UTF8 layer with more features added later is a good way. - Einar Karttunen

John Meacham

2:17 a.m.

New subject: Data.ByteString candidate 3

On Wed, Apr 26, 2006 at 04:48:52AM +0300, Einar Karttunen wrote:

...

I would like: * Data.ByteString.Word8 * Data.ByteString.Char8 * Data.ByteString.UTF

And select your favorite and make Data.ByteString export that one. I think that could be the Word8 or the UTF one.

ByteString should be the pure Word8 version. the others can be based on it. ByteString is quite a useful data type independent of anything to do with strings. I'd like to see Data.PackedString be what you are calling Data.ByteString.UTF and PackedString _specifically_ be a drop-in replacement for String with an abstract internal representation and should behave the same as String except when it comes to time and space. I want to be able to just change a few types and routines to PackedString from String in a library and be guarenteed I am not affecting the meaning of a program. (or vice versa) though, I do much much prefer the 'Char8' term to 'Latin1'. I think it better represents what it does. just 'Chars truncated to 8 bits' while 'latin1' might have other unintended connotations. The fact that the standard routines will interpret them as latin1 can be infered from the fact that the standard routines interpret Chars as unicode code points. In particular, if you do something wacky where you don't store unicode values in a 'Char' it doesn't magically become 'Latin1' just because you store it in a latin1 string, it just becomes whatever you put in truncated to 8 bits and hopefully you know what you are doing.

...

...
I don't see why. ascii is a subset of utf8, the routines building a packedstring from an ascii string or a utf8 string can be identical, if you know your string is ascii to begin with you can use an optimized routine but the end result is the same as if you used the general utf8 version.

Actually toUpper works differently on ascii + something in the high bytes and ISO-8859-1. Same with all the isXXX predicates, fortunately not a problem for things like whitespace.

I am not sure what you mean, the data would always be utf8 full unicode values in a PackedString, there would just be efficient ways to pull in data you know is ascii since it can just use a memcpy rather than recoding it from whatever format it is in. The fact that it happens to just contain values < 128 won't make a different for subsequent handling of the string. (except perhaps some routines will be faster). when I say ASCII here, I just mean a utf8 string where all values happen to be < 128, which is happily binary compatable with ASCII.

...

...
the proper thing for PackedString is to make it behave exactly as the String instances behave, since it is suposed to be a drop in replacement. Which means the natuarl ordering based on the Char order and the toLower and toUpper from the libraries.

toUpper and toLower are the correct version in the standard and they use the unicode tables. The natural ordering by codepoint without any normalization is not very useful for text handling, but works for e.g. putting strings in a Map.

yeah, and it is fast. I always thought we should have two Ord classes, one for human digestable ordering and the other for fast implementation dependent ordering for use only in things like Map and Set. but that is a different issue. in any case, the point I was trying to make is that PackedString should behave exactly like String, whether the instances for String are doing the right thing is a different matter.

...

...
uncode collation, graphemes, normalization, and localized sorting can be provided as separate routines as another project (it would be nice to have them work on both Strings and PackedStrings, so perhaps they could be in a class?)

These are quite essential for really working with unicode characters. It didn't matter much before as Haskell didn't provide good ways to handle unicode chars with IO, but these are very important, otherwise it becomes hard to do many useful things with the parsed unicode characters.

yeah, they would be useful things to have. but no need to tie them specifically to PackedString (though, they would operate on PackedStrings most likely). ginsu and jhc both use unicode extensivly without these routines, so saying it is hard to do useful things is somewhat strong. but they would definitly be very useful to have and necessary for certain applications.

...

How are we supposed to process user input without normalization e.g. if we need to compare Strings for equivalence?

we implement normalization and provide it as a library :)

...

But a simple UTF8 layer with more features added later is a good way.

I don't think these features should be in PackedString proper unless they are added to String as well. (as in, in the default instances), however a 'UnicodeString' that is a newtype of PackedString would be easy enough with just different instance declarations. the library routines for performing these transformations can be provided in PackedString of course if that makes sense if they don't conflict with any String operations of the same name. but being able to do 'normalize a == normalize b' would be useful for PackedStrings independent of UnicodeString. John -- John Meacham - ⑆repetae.net⑆john⑈

Axel Simon

7:50 a.m.

New subject: Data.ByteString candidate 3

On Wed, 2006-04-26 at 02:16 +0300, Einar Karttunen wrote:

...

This is very useful for many purposes and does not mean that there should not be a fancy UTF8 module. Rather than arguing about killing this, wouldn't it be more productive to create the UTF8 module?

I've been following this thread with some frowning. I can see that some people want to dish out text over the network *really fast* and thus would like the ability to emit pure ASCII without the overhead of 4 bytes per character. Still, I don't see the need for a .Latin1 module next to a .Word8 module. When it comes to UTF8, I cringe. Dealing with UTF8 is such a nightmare to get right and it won't show up until you're test some Chinese texts with it (or are there other common 4-byte characters?). Hence, UTF8 should not be a common interface for application developers. Haskell has the advantage that changing Char form 8 bits to 32 bits doesn't add to the space consumption of lists. With packed string the situation is different, but still, I propose to - have a library that deals with packed strings of 32-bit Haskell Char - have a library that deals with packed Word8 sequences This way, it will hurt if you touch the bare-metal Word8 representation, but then, using Word8 sequences is quite an optimisation that you don't use when you start developing an application. A simplistic solution like this avoids the whole discussion on whether there should be an Ord or toUpper for Latin1, or how to coerce a packed Latin1 string to a packed Word8 representation. Axel.

dons＠cse.unsw.edu.au

8:35 a.m.

New subject: Data.ByteString candidate 3

A.Simon:

...

different, but still, I propose to

- have a library that deals with packed strings of 32-bit Haskell Char - have a library that deals with packed Word8 sequences

This way, it will hurt if you touch the bare-metal Word8 representation, but then, using Word8 sequences is quite an optimisation that you don't use when you start developing an application. A simplistic solution like this avoids the whole discussion on whether there should be an Ord or toUpper for Latin1, or how to coerce a packed Latin1 string to a packed Word8 representation.

I'd like to say that all I want to do is have the Word8 "bare metal" layer, and a minimal Char8 layer layer on top (where all conversions are equivalent to id) to make the fast layer usable for speed-is-everything projects. If we don't add the Char8 layer, the projects will end up having to write their own anyway, since Word8 literals are unbearable. This is what's currently implemented. I'm providing the 2nd part of your plan above, with a little sugar for people like me and Einar who need it. The first part, 32-bit packed haskell strings, is another piece of work. I'm not sure we need 5 kinds of Foo-encoding layers, and I don't plan to write them. -- Don

Axel Simon

9:21 a.m.

New subject: Data.ByteString candidate 3

On Wed, 2006-04-26 at 18:35 +1000, Donald Bruce Stewart wrote:

...

A.Simon:

...
different, but still, I propose to

- have a library that deals with packed strings of 32-bit Haskell Char - have a library that deals with packed Word8 sequences

This way, it will hurt if you touch the bare-metal Word8 representation, but then, using Word8 sequences is quite an optimisation that you don't use when you start developing an application. A simplistic solution like this avoids the whole discussion on whether there should be an Ord or toUpper for Latin1, or how to coerce a packed Latin1 string to a packed Word8 representation.

I'd like to say that all I want to do is have the Word8 "bare metal" layer, and a minimal Char8 layer layer on top (where all conversions are equivalent to id) to make the fast layer usable for speed-is-everything projects. If we don't add the Char8 layer, the projects will end up having to write their own anyway, since Word8 literals are unbearable.

I don't understand the need for the Char8 layer. How are Char8 literals different from Word8 literals? You couldn't use "string" or 's' either way.

...

This is what's currently implemented.

I'm providing the 2nd part of your plan above, with a little sugar for people like me and Einar who need it. The first part, 32-bit packed haskell strings, is another piece of work.

Ok, fair enough. Axel.

Bulat Ziganshin

9:40 a.m.

New subject: Re[2]: Data.ByteString candidate 3

Hello Donald, Wednesday, April 26, 2006, 12:35:16 PM, you wrote:

...

I'm not sure we need 5 kinds of Foo-encoding layers, and I don't plan to write them.

let's count: Latin1 - already written by you UTF8 - requested by many people here, required to work with compiler's input in ghc/jhc, and is the most compact representation for general string UCS4 - already implemented in Data.PackedString, fastest way to work with general strings (i mean faster indexing and other direct-index ops) UTF16 - used in Windows API, so it's implementation will be really useful to simplify this API implementation and to allow application programs to work directly with such strings instead of converting them from/to some other format btw, what will be really useful now, imho, is the interface to Text.Regex. how about working on it as next stage? and one more suggestion - you can significantly speedup your code by importing the 6.5's ForeignPtr implementation inside your library. This type almost don't appears in ByteString external interface, so this should be not so huge work -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

dons＠cse.unsw.edu.au

10:19 a.m.

New subject: Data.ByteString candidate 3

bulat.ziganshin:

...

Hello Donald, btw, what will be really useful now, imho, is the interface to Text.Regex. how about working on it as next stage?

This is already done actually, here: http://www.cse.unsw.edu.au/~dons/code/lambdabot/Lib/Regex.hsc http://www.cse.unsw.edu.au/~dons/code/hmp3/Regex.hsc :)

...

and one more suggestion - you can significantly speedup your code by importing the 6.5's ForeignPtr implementation inside your library. This type almost don't appears in ByteString external interface, so this should be not so huge work

Ah! That's a good idea. -- Don

Bulat Ziganshin

4:11 p.m.

New subject: Re[2]: Data.ByteString candidate 3

Hello Donald, Wednesday, April 26, 2006, 2:19:34 PM, you wrote:

...

...
btw, what will be really useful now, imho, is the interface to Text.Regex. how about working on it as next stage?

...

This is already done actually, here: http://www.cse.unsw.edu.au/~dons/code/lambdabot/Lib/Regex.hsc http://www.cse.unsw.edu.au/~dons/code/hmp3/Regex.hsc

please include it in your lib, it is very useful thing imho -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

John Meacham

27 Apr 27 Apr

12:22 a.m.

New subject: Data.ByteString candidate 3

On Wed, Apr 26, 2006 at 08:19:34PM +1000, Donald Bruce Stewart wrote:

...

bulat.ziganshin:

...
Hello Donald, btw, what will be really useful now, imho, is the interface to Text.Regex. how about working on it as next stage?

This is already done actually, here: http://www.cse.unsw.edu.au/~dons/code/lambdabot/Lib/Regex.hsc http://www.cse.unsw.edu.au/~dons/code/hmp3/Regex.hsc

I have a regex interface to PCRE and some neat typeclass tricks to give you perls (=~) operator but much more powerful here. http://repetae.net/john/computer/haskell/JRegex/ It would be nice to get a PCRE binding in the libraries if it is available. if there is interest in including this in the fptools libraries I can revisit and clean-up/modernize the code. John -- John Meacham - ⑆repetae.net⑆john⑈

dons＠cse.unsw.edu.au

7:16 a.m.

New subject: Data.ByteString candidate 3

john:

...

On Wed, Apr 26, 2006 at 08:19:34PM +1000, Donald Bruce Stewart wrote:

...
bulat.ziganshin:

...
Hello Donald, btw, what will be really useful now, imho, is the interface to Text.Regex. how about working on it as next stage?

This is already done actually, here: http://www.cse.unsw.edu.au/~dons/code/lambdabot/Lib/Regex.hsc http://www.cse.unsw.edu.au/~dons/code/hmp3/Regex.hsc

I have a regex interface to PCRE and some neat typeclass tricks to give you perls (=~) operator but much more powerful here.

http://repetae.net/john/computer/haskell/JRegex/

It would be nice to get a PCRE binding in the libraries if it is available.

if there is interest in including this in the fptools libraries I can revisit and clean-up/modernize the code.

We really longed for a high performance regex lib in the standard libraries while working on the shootout earlier this year. Text.Regex is far too inefficient due to all the pack/unpackings. and even then C's regexes aren't so great. In fact, Chris K ended up writing Tex.Regex.Lazy as a result of this effort. Here's a nice benchmark for you code: http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all I wonder if JRegex would give us a faster entry? After fast IO, regexes are the other thing we need to improve for ghc 6.6, I think. So at least the people who worked on the shootout would be interested :) -- Don

Simon Marlow

9:25 a.m.

New subject: Data.ByteString candidate 3

John Meacham wrote:

...

On Wed, Apr 26, 2006 at 08:19:34PM +1000, Donald Bruce Stewart wrote:

...
bulat.ziganshin:

...
Hello Donald, btw, what will be really useful now, imho, is the interface to Text.Regex. how about working on it as next stage?

This is already done actually, here: http://www.cse.unsw.edu.au/~dons/code/lambdabot/Lib/Regex.hsc http://www.cse.unsw.edu.au/~dons/code/hmp3/Regex.hsc

I have a regex interface to PCRE and some neat typeclass tricks to give you perls (=~) operator but much more powerful here.

http://repetae.net/john/computer/haskell/JRegex/

It would be nice to get a PCRE binding in the libraries if it is available.

if there is interest in including this in the fptools libraries I can revisit and clean-up/modernize the code.

Actually yes, I did intend to replace/extend Text.Regex with JRegex at some point. Plus we can include PCRE, since it has a BSD license - maybe it can replace the POSIX regex implementation that we have in GHC right now (which was taken from FreeBSD's libc). I imagine doing this as part of the library reorg we have planned for 6.6. http://hackage.haskell.org/trac/ghc/ticket/710 Cheers, Simon

Ketil Malde

11:52 a.m.

New subject: Data.ByteString candidate 3

I hope you'll forgive me for re-advertising my FPS modifications. I've started over from Don's sources (please don't use my old fps repo), refactored, and reworked my changes into that. The refactored repo (all functionality and performance identical to the original): http://www.ii.uib.no/~ketil/src/fps-wrapped Repo with added Latin1 and ASCII support: http://www.ii.uib.no/~ketil/src/fps-i18n Latin1 functions equal to Char8, but packing chars > 255 will give an error. ASCII does the same, but stores characters > 127 out of harms way. Adding support for new character sets requires defining four functions and three constants, and #include'ing a common file. In addition, some nice properties hold, for instance: s1 > s2 => pack s1 > pack s2 w2c . c2w == id -- provided no error c2w . w2c == id -- total function Only the latter holds for Char8. Latin1 has been tested with the Char8 QC tests, and they have all been subjected to the benchmark suite, results at http://www.ii.uib.no/~ketil/src/bench.txt (This is using /usr/share/word/dict) Packing and unpacking isn't part of the benchmark, but is expected to be around 10% slower than for Char8. I have no explanation why 'map' and 'split' are faster. -k -- If I haven't seen further, it is by standing in the footprints of giants

John Meacham

7:11 p.m.

New subject: Data.ByteString candidate 3

On Thu, Apr 27, 2006 at 10:25:53AM +0100, Simon Marlow wrote:

...

Actually yes, I did intend to replace/extend Text.Regex with JRegex at some point. Plus we can include PCRE, since it has a BSD license - maybe it can replace the POSIX regex implementation that we have in GHC right now (which was taken from FreeBSD's libc).

That would actually be nice, the PCRE that comes with most systems isn't compiled with unicode support enabled. However you can set flags when compiling PCRE to let it handle UTF8 directly. John -- John Meacham - ⑆repetae.net⑆john⑈

Ketil Malde

26 Apr 26 Apr

11:51 a.m.

New subject: Data.ByteString candidate 3

dons@cse.unsw.edu.au (Donald Bruce Stewart) writes:

...

I'd like to say that all I want to do is have the Word8 "bare metal" layer, and a minimal Char8 layer layer on top

...

This is what's currently implemented.

I've now added a Latin1 module, that works like Char8, but where packing a Char >255 is an error. This means some extra checking, packing 45M characters from [Char] to ByteString slows down from (very rougly) 6.4 to 5.6 Mb/s. Many (but probably less important) operations will be faster, checking if c >= 256 is `elem` a Latin1 ByteString will be O(1) and always False. (Char8 will need to scan the string for c `mod` 256). Feel free to grab, read, criticize, or benchmark, darcs get http://www.ii.uib.no/~ketil/src/fps -k -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde

2:43 p.m.

New subject: Data.ByteString candidate 3

Ketil Malde writes:

...

I've now added a Latin1 module, that works like Char8, but where

And now there's an ASCII module, which instead of storing bytes > 127 in the latin1 range, puts them in the "private" area of 0xF000..0xF07F. This way, they won't be affected by other Char functions depending on case etc. Packing Chars outside of this area is still an error (i.e. no 8 bit truncation) IMHO this is the correct way to provide a Char interface to ASCII (albeit at a performance penalty), but I simply can't wait to hear what other people have to say about the matter. :-) -k -- If I haven't seen further, it is by standing in the footprints of giants

dons＠cse.unsw.edu.au

25 Apr 25 Apr

12:14 p.m.

New subject: Data.ByteString candidate 3

simonmarhaskell:

...

Donald Bruce Stewart wrote:

...
The code has been partioned into: Data.ByteString a Word8 only layer. All functions are in terms of Word8 Data.ByteString.Char provides an ascii/byte-Char layer over the Word8 layer.

Ok, but where would we put a UTF8 version of the Char layer? I'm thinking that "Latin1" would be more correct than "Char", and leaves room for adding UTF8 and other encodings later.

Ok. Einar had some concerns that Latin1 wasn't the most accurate, as he uses the Char ops for more general purposes. But Data.ByteString.Latin1 would probably be ok for me. Or Data.ByteString.Char8 perhaps. -- Don

John Meacham

12:46 p.m.

New subject: Data.ByteString candidate 3

On Tue, Apr 25, 2006 at 09:12:55PM +1000, Donald Bruce Stewart wrote:

...

I'm quite happy with this now, the code is a lot cleaner, and hopefully this will appease the people who disliked the intermingling of Char and Word8 code, which I agree was unsatisfactory.

Opinions?

Well, it is all great except it doesn't provide what we really want, a drop in fast replacement for haskell strings :) I'd like to see the Char and String names reserved for things that actually can represent Chars and Strings. The internal representation can be completly abstract based on the ByteString data type (though, I am partial to utf8). not that a 'Latin1' module couldn't be provided too. But I don't think the module should be called 'Char'. John -- John Meacham - ⑆repetae.net⑆john⑈

dons＠cse.unsw.edu.au

26 Apr 26 Apr

3:21 a.m.

New subject: Data.ByteString candidate 4

Ok, I've tried to encorporate the suggestions from yesterday's discussion. API: http://www.cse.unsw.edu.au/~dons/fps/new/ Src: http://www.cse.unsw.edu.au/~dons/fps.html Changes: * Char functions live in Data.ByteString.Char8 * Improved docs * Anything that needs Data.Char is now in Char8 (lines, words..) * Confirmed that Char8 runs at the same speed as the Word8 layer * isSuffix is about 100x faster. No claims about being a 'Char' packed string library. Don't make claims about encodings. Char8.hs is just a no-op layer over the underlying Data.ByteString Word8 ops. I'm wary of claiming 'PackedString' status, as John says, it isn't a drop in replacement, so Data.ByteString.Char8 seems fine to me. -- Don

John Meacham

9:11 a.m.

New subject: Data.ByteString candidate 4

On Wed, Apr 26, 2006 at 01:21:17PM +1000, Donald Bruce Stewart wrote:

...

I'm wary of claiming 'PackedString' status, as John says, it isn't a drop in replacement, so Data.ByteString.Char8 seems fine to me.

I like it a lot. perfect for what it does. is somone working on a Data.PackedString or should I have a go at it? should I send patches to your darcs repo? John -- John Meacham - ⑆repetae.net⑆john⑈

dons＠cse.unsw.edu.au

27 Apr 27 Apr

7:09 a.m.

New subject: Data.ByteString candidate 4

john:

...

On Wed, Apr 26, 2006 at 01:21:17PM +1000, Donald Bruce Stewart wrote:

...
I'm wary of claiming 'PackedString' status, as John says, it isn't a drop in replacement, so Data.ByteString.Char8 seems fine to me.

I like it a lot. perfect for what it does. is somone working on a Data.PackedString or should I have a go at it? should I send patches to your darcs repo?

I don't think anyone is working on it at the moment. And I'm happy for patches, or maybe it should be another repo (so I can just concentrate on getting Data.ByteString into the base libs). No matter. Also, today I checked that Data.ByteString.* runs in hugs (it does), since its H98 + FFI+ cpp, so now I'm wondering if JHC can compile it...? -- Don

Bulat Ziganshin

9:56 a.m.

New subject: Re[2]: Data.ByteString candidate 4

Hello Donald, Thursday, April 27, 2006, 11:09:24 AM, you wrote:

...

Also, today I checked that Data.ByteString.* runs in hugs (it does),

that's great for debugging haskell programs

...

since its H98 + FFI+ cpp, so now I'm wondering if JHC can compile it...?

and nhc/yhc too. after all, base libs is a common library for hugs, ghc and nhc -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Ashley Yakeley

26 Apr 26 Apr

7:55 p.m.

New subject: Data.ByteString candidate 3

Donald Bruce Stewart wrote:

...

Ok, here's what I've done: http://www.cse.unsw.edu.au/~dons/fps/new/

...

The code has been partioned into: Data.ByteString a Word8 only layer. All functions are in terms of Word8

Do the file-handling Word8 functions always work correctly, or do they do some kind of round-trip Char conversion? We've needed Word8 file access, so this would be very helpful. For instance: writeFile "myfile" (pack [0..255]) This should always write exactly the bytes 0 to 255, with no text-related weirdness such as charset remapping or newline conversion. -- Ashley Yakeley, Seattle WA WWEWDD? http://www.cs.utexas.edu/users/EWD/

dons＠cse.unsw.edu.au

11:14 p.m.

New subject: Data.ByteString candidate 3

ashley:

...

Donald Bruce Stewart wrote:

...
Ok, here's what I've done: http://www.cse.unsw.edu.au/~dons/fps/new/

...
The code has been partioned into: Data.ByteString a Word8 only layer. All functions are in terms of Word8

Do the file-handling Word8 functions always work correctly, or do they do some kind of round-trip Char conversion? We've needed Word8 file access, so this would be very helpful. For instance:

writeFile "myfile" (pack [0..255])

This should always write exactly the bytes 0 to 255, with no text-related weirdness such as charset remapping or newline conversion.

There is no round trip at all. Nothing is converted to Char in the Word8 code: Prelude> Data.ByteString.writeFile "myfile" (Data.ByteString.pack [0..255]) $ od -t 'd1' myfile 0000000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0000020 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0000040 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 0000060 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 0000100 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 0000120 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 0000140 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 0000160 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 0000200 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 0000220 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 0000240 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 0000260 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 0000300 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 0000320 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 0000340 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 0000360 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 -- Don

Einar Karttunen

25 Apr 25 Apr

9:08 a.m.

On 24.04 16:31, John Meacham wrote:

...

On Sun, Apr 23, 2006 at 05:27:43PM +1000, Donald Bruce Stewart wrote:

...
Following discussion, I've tagged FPS 0.4, a candidate for the base library. Changes:

* Renamed to Data.ByteString(ByteString) * Improved documentation * Tweaks to build under ghc 6.6 * Added: getLine, getContents, putStr, putStrLn, zip, unzip, zipWith * Much faster: elemIndices, lineIndices, split, replicate * More automagic benchmarks and QuickCheck tests.

Can we get rid of every reference to 'Char' in the interface? a search and replace setting them to 'Word8' should do it. Casting between Word8 and Char is just very wrong. a Char based FastString can be built on top of it, but we want to be typesafe in any interface.

The Chars in the interface make it much more easy to use in production code. Should we also change the type of putStrLn to: putStrLn :: [Word8] -> IO () ? I think the name ByteString implies that it uses bytes and removing the Char functions does not help much. I am mostly using them to handle UTF8 data at the moment and it works quite well. In effect this would mean sprinkling all my code with fromIntegral. The name Latin1 is particularly bad since there are many other single byte encodings around. We could have a Data.ByteString with a Word8+Char interface and the module name telling us it is about bytes. Then we can have a Data.Encode.String.{UTF8,UTF16BE,UTF16LE,UTF32BE,UTF32LE,Latin{1,2,3...}} - Einar Karttunen

Ross Paterson

12:11 p.m.

On Tue, Apr 25, 2006 at 12:08:45PM +0300, Einar Karttunen wrote:

...

The name Latin1 is particularly bad since there are many other single byte encodings around.

The name is quite appropriate, since that is the particular encoding of Char that is exposed by the interface. What's bad is that there's no choice. Calling it Latin1 is just being honest about that, and leaving room for modules with other encodings or an interface parameterized by encoding.

dons＠cse.unsw.edu.au

12:34 p.m.

ross:

...

On Tue, Apr 25, 2006 at 12:08:45PM +0300, Einar Karttunen wrote:

...
The name Latin1 is particularly bad since there are many other single byte encodings around.

The name is quite appropriate, since that is the particular encoding of Char that is exposed by the interface. What's bad is that there's no choice. Calling it Latin1 is just being honest about that, and leaving room for modules with other encodings or an interface parameterized by encoding.

Ok. Duncan, Ketil, Ross and Simon make good points here. I'll move Data.ByteString.Char -> Data.ByteString.Latin1 -- Don

Duncan Coutts

2:25 p.m.

On Tue, 2006-04-25 at 22:34 +1000, Donald Bruce Stewart wrote:

...

ross:

...
On Tue, Apr 25, 2006 at 12:08:45PM +0300, Einar Karttunen wrote:

...
The name Latin1 is particularly bad since there are many other single byte encodings around.

The name is quite appropriate, since that is the particular encoding of Char that is exposed by the interface. What's bad is that there's no choice. Calling it Latin1 is just being honest about that, and leaving room for modules with other encodings or an interface parameterized by encoding.

Ok. Duncan, Ketil, Ross and Simon make good points here. I'll move Data.ByteString.Char -> Data.ByteString.Latin1

If you want to justify that and provide some concrete spec you can add something like the following to the Data.ByteString.Latin1 docs: Manipulate ByteStrings using Char operations. All Chars will be truncated to 8 bits. More specifically these byte strings are taken to be in the subset of Unicode covered by code points 0-255. This covers Unicode Basic Latin, Latin-1 Supplement and C0+C1 Controls. See: http://www.unicode.org/charts/ http://www.unicode.org/charts/PDF/U0000.pdf http://www.unicode.org/charts/PDF/U0080.pdf One reason to be so specific is that other definitions of character sets commonly called "Latin-1" omit the control characters and so do not cover all bytes 0-255. I think this allows us to justify reinterpreting Word8s as Chars and getting valid Unicode code points. Duncan

Duncan Coutts

4:50 p.m.

On Tue, 2006-04-25 at 22:34 +1000, Donald Bruce Stewart wrote:

...

ross:

...
On Tue, Apr 25, 2006 at 12:08:45PM +0300, Einar Karttunen wrote:

...
The name Latin1 is particularly bad since there are many other single byte encodings around.

The name is quite appropriate, since that is the particular encoding of Char that is exposed by the interface. What's bad is that there's no choice. Calling it Latin1 is just being honest about that, and leaving room for modules with other encodings or an interface parameterized by encoding.

Ok. Duncan, Ketil, Ross and Simon make good points here. I'll move Data.ByteString.Char -> Data.ByteString.Latin1

Ok one final point from a discussion between me and Einar Karttunen... (I'm mindful of Simon's comment about sheds... :-) ) There are two different common uses of a 8-bit string library with different assumptions and guarantees. (As it happens they have the same implementation) In one use case, we want to be able to guarantee that we can get Chars out of our string and guarantee that they really are Haskell Chars. That is that they are valid Unicode code points which we could pass to functions like isUpper and get valid answers. As an example consider Char 'Â' (chr 0xC2, Latin capital A with circumflex). This is not ASCII but it is clearly upper case. If we don't know that we're working with an 8-bit subset of Unicode then we can't use Unicode properties like isUpper etc. Then the other common use case is where we have some character string encoding which contains ASCII as a subset. That is we don't know the encoding exactly (it may be Latin1, LatinN, UTF8, etc) but we do know that ASCII chars 0-127 are represent by those same numbers in our byte stream. Examples where this is useful is in parsing network protocols. There are several examples of these which use 8-bit extensions of ASCII but the protocol only gives semantics to chars in the ASCII subset. For this case it would be very inconvenient to have to use an API based just on Word8 but on the other hand we can't give a proper guarantee on being able to turn bytes into Haskell Chars (only for bytes <127). So what do we do about this? Einar was thinking about an API that might look like this: Data.ByteString.{Char8, Latin1, Latin2, ..., UTF8, ...} Char8 should provide: * litle overhead * For ascii characters the right translation * c2w . w2c = id * toUpper and toLower on Ascii * Ord with raw byte values Latin1 should guarantee: * Correct translation for Latin1, C0 and C1 characters * Really just a subset of unicode for character handling * Predicates like toUpper and toLower * toUpper and toLower per Unicode definition (there is no common latin1 definition afaik) * Ord per UCA (unicode collation algorithm) * Or use locale for toUpper/toLower and Ord. So basically the .Char8 module is for the ASCII extension case and the .Latin1 is for the 8-bit Unicode subset case. I think in fact that darcs would want the .Char8 version but I expect that may other users will want a library that can guarantee conversions to ordinary Haskell Chars (which involves an assumption on the character encoding). Duncan

7006

Age (days ago)

7010

Last active (days ago)

List overview

Download

41 comments

10 participants

participants (10)

Ashley Yakeley
Axel Simon
Bulat Ziganshin
dons＠cse.unsw.edu.au
Duncan Coutts
Einar Karttunen
John Meacham
Ketil Malde
Ross Paterson
Simon Marlow