PROPOSAL: New efficient Unicode string library.

older
"with" and "preserving" for local...

Johan Tibell

24 Sep 2007 24 Sep '07

10:52 p.m.

Dear haskell-cafe, I would like to propose a new, ByteString like, Unicode string library which can be used where both efficiency (currently offered by ByteString) and i18n support (currently offered by vanilla Strings) are needed. I wrote a skeleton draft today but I'm a bit tired so I didn't get all the details. Nevertheless I think it fleshed out enough for some initial feedback. If I can get the important parts nailed down before Hackathon I could hack on it there. Apologies for not getting everything we discussed on #haskell down in the first draft. It'll get in there eventually. Bring out your Unicode kung-fu! http://haskell.org/haskellwiki/UnicodeByteString Cheers, Johan Tibell

Show replies by date

Twan van Laarhoven

24 Sep 24 Sep

11:08 p.m.

Johan Tibell wrote:

...

Dear haskell-cafe,

I would like to propose a new, ByteString like, Unicode string library which can be used where both efficiency (currently offered by ByteString) and i18n support (currently offered by vanilla Strings) are needed. I wrote a skeleton draft today but I'm a bit tired so I didn't get all the details. Nevertheless I think it fleshed out enough for some initial feedback. If I can get the important parts nailed down before Hackathon I could hack on it there.

Apologies for not getting everything we discussed on #haskell down in the first draft. It'll get in there eventually.

Bring out your Unicode kung-fu!

http://haskell.org/haskellwiki/UnicodeByteString

Have you looked at my CompactString library[1]? It essentially does exactly this, with one extension: the type is parameterized over the encoding. From the discussion on #haskell it would seem that some people consider this unforgivable, while others consider it essential. In my opinion flexibility should be more important, you can always restrict things later. For the common case where encoding doesn't matter there is Data.CompactString.UTF8, which provides an un-parameterized type. I called this type 'CompactString' as well, which might be a bit unfortunate. I don't like the name UnicodeString, since it suggests that the normal string somehow doesn't support unicode. This module could be made more prominent. Maybe Data.CompactString could be the specialized type, while Data.CompactString.Parameterized supports different encodings. A word of warning: The library is still in the alpha stage of development. I don't fully trust it myself yet :) [1] http://twan.home.fmf.nl/compact-string/ Twan

Vitaliy Akimov

25 Sep 25 Sep

8:17 a.m.

Hi, thanks for proposal, Why questions connected with converting are considered only? The library i18n should give a number of other services such as normalization, comparison, sorting, etc. Furthermore it's not so easy to keep such library up to date. Why simply do not make a bindings to IBM ICU (http://www-306.ibm.com/software/globalization/icu/index.jsp) library which is up to date unicode implementation? Vitaliy. 2007/9/25, Johan Tibell :

...

Dear haskell-cafe,

I would like to propose a new, ByteString like, Unicode string library which can be used where both efficiency (currently offered by ByteString) and i18n support (currently offered by vanilla Strings) are needed. I wrote a skeleton draft today but I'm a bit tired so I didn't get all the details. Nevertheless I think it fleshed out enough for some initial feedback. If I can get the important parts nailed down before Hackathon I could hack on it there.

Apologies for not getting everything we discussed on #haskell down in the first draft. It'll get in there eventually.

Bring out your Unicode kung-fu!

http://haskell.org/haskellwiki/UnicodeByteString

Cheers,

Johan Tibell _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Deborah Goldsmith

26 Sep 26 Sep

2:47 a.m.

I'll look over the proposal more carefully when I get time, but the most important issue is to not let the storage type leak into the interface. From an implementation point of view, UTF-16 is the most efficient representation for processing Unicode. It's the native Unicode representation for Windows, Mac OS X, and the ICU open source i18n library. UTF-8 is not very efficient for anything except English. Its most valuable property is compatibility with software that thinks of character strings as byte arrays, and in fact that's why it was invented. UTF-32 is conceptually cleaner, but characters outside the BMP (Basic Multilingual Plane) are rare in actual text, so UTF-16 turns out to be the best combination of space and time efficiency. Deborah On Sep 24, 2007, at 3:52 PM, Johan Tibell wrote:

...

Dear haskell-cafe,

I would like to propose a new, ByteString like, Unicode string library which can be used where both efficiency (currently offered by ByteString) and i18n support (currently offered by vanilla Strings) are needed. I wrote a skeleton draft today but I'm a bit tired so I didn't get all the details. Nevertheless I think it fleshed out enough for some initial feedback. If I can get the important parts nailed down before Hackathon I could hack on it there.

Apologies for not getting everything we discussed on #haskell down in the first draft. It'll get in there eventually.

Bring out your Unicode kung-fu!

http://haskell.org/haskellwiki/UnicodeByteString

Cheers,

Johan Tibell _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Aaron Denney

3:10 a.m.

On 2007-09-26, Deborah Goldsmith wrote:

...

From an implementation point of view, UTF-16 is the most efficient representation for processing Unicode.

This depends on the characteristics of the text being processed. Spacewise, English stays 1 byte/char in UTF-8. Most European languages go up to at most 2, and on average only a bit above 1. Greek and Cyrillic are 2 bytes/char. It's really only the Asian, African, Arabic, etc, that lose space-wise. It's true that time-wise there are definite issues in finding character boundaries. -- Aaron Denney -><-

Tony Finch

10:25 a.m.

On Wed, 26 Sep 2007, Aaron Denney wrote:

...

It's true that time-wise there are definite issues in finding character boundaries.

UTF-16 has no advantage over UTF-8 in this respect, because of surrogate pairs and combining characters. Code points, characters, and glyphs are all different things, and it's very difficult to represent the latter two as anything other than a string of code points. Tony. -- f.a.n.finch http://dotat.at/ IRISH SEA: SOUTHERLY, BACKING NORTHEASTERLY FOR A TIME, 3 OR 4. SLIGHT OR MODERATE. SHOWERS. MODERATE OR GOOD, OCCASIONALLY POOR.

Aaron Denney

6:06 p.m.

On 2007-09-26, Tony Finch wrote:

...

On Wed, 26 Sep 2007, Aaron Denney wrote:

...
It's true that time-wise there are definite issues in finding character boundaries.

UTF-16 has no advantage over UTF-8 in this respect, because of surrogate pairs and combining characters.

Good point. -- Aaron Denney -><-

Deborah Goldsmith

27 Sep 27 Sep

1:49 a.m.

On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:

...

...
UTF-16 has no advantage over UTF-8 in this respect, because of surrogate pairs and combining characters.

Good point.

Well, not so much. As Duncan mentioned, it's a matter of what the most common case is. UTF-16 is effectively fixed-width for the majority of text in the majority of languages. Combining sequences and surrogate pairs are relatively infrequent. Speaking as someone who has done a lot of Unicode implementation, I would say UTF-16 represents the best time/space tradeoff for an internal representation. As I mentioned, it's what's used in Windows, Mac OS X, ICU, and Java. Deborah

Aaron Denney

6:39 a.m.

On 2007-09-27, Deborah Goldsmith wrote:

...

On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:

...
...
UTF-16 has no advantage over UTF-8 in this respect, because of surrogate pairs and combining characters.

Good point.

Well, not so much. As Duncan mentioned, it's a matter of what the most common case is. UTF-16 is effectively fixed-width for the majority of text in the majority of languages. Combining sequences and surrogate pairs are relatively infrequent.

Infrequent, but they exist, which means you can't seek x/2 bytes ahead to seek x characters ahead. All such seeking must be linear for both UTF-16 *and* UTF-8. -- Aaron Denney -><-

Ross Paterson

7:45 a.m.

On Thu, Sep 27, 2007 at 06:39:24AM +0000, Aaron Denney wrote:

...

On 2007-09-27, Deborah Goldsmith wrote:

...
Well, not so much. As Duncan mentioned, it's a matter of what the most common case is. UTF-16 is effectively fixed-width for the majority of text in the majority of languages. Combining sequences and surrogate pairs are relatively infrequent.

Infrequent, but they exist, which means you can't seek x/2 bytes ahead to seek x characters ahead. All such seeking must be linear for both UTF-16 *and* UTF-8.

You could get rapid seeks by ignoring the UTFs and representing strings as sequences of chunks, where each chunk is uniformly 8-bit, 16-bit or 32-bit as required to cover the characters it contains. Hardly anyone would need 32-bit chunks (and some of us would need only the 8-bit ones).

Duncan Coutts

10:34 a.m.

In message wnoise@ofb.net writes:

...

On 2007-09-27, Deborah Goldsmith wrote:

...
On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:

...
...
UTF-16 has no advantage over UTF-8 in this respect, because of surrogate pairs and combining characters.

Good point.

Well, not so much. As Duncan mentioned, it's a matter of what the most common case is. UTF-16 is effectively fixed-width for the majority of text in the majority of languages. Combining sequences and surrogate pairs are relatively infrequent.

Infrequent, but they exist, which means you can't seek x/2 bytes ahead to seek x characters ahead. All such seeking must be linear for both UTF-16 *and* UTF-8.

And in [Char] for all these years, yet I don't hear people complaining. Most string processing is linear and does not need random access to characters. Duncan

Chaddaï Fouché

11:01 a.m.

2007/9/27, Duncan Coutts :

...

...
Infrequent, but they exist, which means you can't seek x/2 bytes ahead to seek x characters ahead. All such seeking must be linear for both UTF-16 *and* UTF-8.

And in [Char] for all these years, yet I don't hear people complaining. Most string processing is linear and does not need random access to characters.

Well, if you never heard anyone complaining about [Char] and never had any problem with it's slowness, you're probably not in a field where the efficiency of a Unicode library is really a concern, that's for sure. (I know that the _main_ problem with [Char] wasn't random access, but you must admit [Char] isn't really a good example to speak about efficiency problems) -- Jedaï

Johan Tibell

12:03 p.m.

...

Well, if you never heard anyone complaining about [Char] and never had any problem with it's slowness, you're probably not in a field where the efficiency of a Unicode library is really a concern, that's for sure. (I know that the _main_ problem with [Char] wasn't random access, but you must admit [Char] isn't really a good example to speak about efficiency problems)

I have problems with [Char] and use ByteString instead but that forces me to keep track of the encoding myself and hence UnicodeString.

Aaron Denney

8:30 p.m.

On 2007-09-27, Duncan Coutts wrote:

...

In message wnoise@ofb.net writes:

...
On 2007-09-27, Deborah Goldsmith wrote:

...
On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:

...
...
UTF-16 has no advantage over UTF-8 in this respect, because of surrogate pairs and combining characters.

Good point.

Well, not so much. As Duncan mentioned, it's a matter of what the most common case is. UTF-16 is effectively fixed-width for the majority of text in the majority of languages. Combining sequences and surrogate pairs are relatively infrequent.

Infrequent, but they exist, which means you can't seek x/2 bytes ahead to seek x characters ahead. All such seeking must be linear for both UTF-16 *and* UTF-8.

And in [Char] for all these years, yet I don't hear people complaining. Most string processing is linear and does not need random access to characters.

Yeah. I'm saying the differences between them are going to be in the constant factors, and that these constant factors will differ between workloads. -- Aaron Denney -><-

Aaron Denney

7:25 p.m.

On 2007-09-27, Aaron Denney wrote:

...

On 2007-09-27, Deborah Goldsmith wrote:

...
On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:

...
...
UTF-16 has no advantage over UTF-8 in this respect, because of surrogate pairs and combining characters.

Good point.

Well, not so much. As Duncan mentioned, it's a matter of what the most common case is. UTF-16 is effectively fixed-width for the majority of text in the majority of languages. Combining sequences and surrogate pairs are relatively infrequent.

Infrequent, but they exist, which means you can't seek x/2 bytes ahead to seek x characters ahead. All such seeking must be linear for both UTF-16 *and* UTF-8.

...
Speaking as someone who has done a lot of Unicode implementation, I would say UTF-16 represents the best time/space tradeoff for an internal representation. As I mentioned, it's what's used in Windows, Mac OS X, ICU, and Java.

I guess why I'm being something of a pain-in-the-ass here, is that I want to use your Unicode implementation expertise to know what these time/space tradeoffs are. Are there any algorithmic asymptotic complexity differences, or all these all constant factors? The constant factors depend on projected workload. And are these actually tradeoffs, except between UTF-32 (which uses native wordsizes on 32-bit platforms) and the other two? Smaller space means smaller cache footprint, which can dominate. Simplicity of algorithms is also a concern. Validating a byte sequence as UTF-8 is harder than validating a sequence of 16-bit values as UTF-16. (I'd also like to see a reference to the Mac OS X encoding. I know that the filesystem interface is UTF-8 (decomposed a certain a way). Is it just that UTF-16 is a common application choice, or is there some common framework or library that uses that?) -- Aaron Denney -><-

Deborah Goldsmith

2 Oct 2 Oct

2:50 a.m.

Sorry for the long delay, work has been really busy... On Sep 27, 2007, at 12:25 PM, Aaron Denney wrote:

...

On 2007-09-27, Aaron Denney wrote:

...
...
Well, not so much. As Duncan mentioned, it's a matter of what the most common case is. UTF-16 is effectively fixed-width for the majority of text in the majority of languages. Combining sequences and surrogate pairs are relatively infrequent.

Infrequent, but they exist, which means you can't seek x/2 bytes ahead to seek x characters ahead. All such seeking must be linear for both UTF-16 *and* UTF-8.

...
Speaking as someone who has done a lot of Unicode implementation, I would say UTF-16 represents the best time/space tradeoff for an internal representation. As I mentioned, it's what's used in Windows, Mac OS X, ICU, and Java.

I guess why I'm being something of a pain-in-the-ass here, is that I want to use your Unicode implementation expertise to know what these time/space tradeoffs are.

Are there any algorithmic asymptotic complexity differences, or all these all constant factors? The constant factors depend on projected workload. And are these actually tradeoffs, except between UTF-32 (which uses native wordsizes on 32-bit platforms) and the other two? Smaller space means smaller cache footprint, which can dominate.

Yes, cache footprint is one reason to use UTF-16 rather than UTF-32. Having no surrogate pairs also doesn't save you anything because you need to handle sequences anyway, such as combining marks and clusters. The best reference for all of this is: http://www.unicode.org/faq/utf_bom.html See especially: http://www.unicode.org/faq/utf_bom.html#10 http://www.unicode.org/faq/utf_bom.html#12 Which data type is best depends on what the purpose is. If the data will primarily be ASCII with an occasional non-ASCII characters, UTF-8 may be best. If the data is general Unicode text, UTF-16 is best. I would think a Unicode string type would be intended for processing natural language text, not just ASCII data.

...

Simplicity of algorithms is also a concern. Validating a byte sequence as UTF-8 is harder than validating a sequence of 16-bit values as UTF-16.

(I'd also like to see a reference to the Mac OS X encoding. I know that the filesystem interface is UTF-8 (decomposed a certain a way). Is it just that UTF-16 is a common application choice, or is there some common framework or library that uses that?)

UTF-16 is the native encoding used for Cocoa, Java, ICU, and Carbon, and is what appears in the APIs for all of them. UTF-16 is also what's stored in the volume catalog on Mac disks. UTF-8 is only used in BSD APIs for backward compatibility. It's also used in plain text files (or XML or HTML), again for compatibility. Deborah

ChrisK

12:11 p.m.

Deborah Goldsmith wrote:

...

UTF-16 is the native encoding used for Cocoa, Java, ICU, and Carbon, and is what appears in the APIs for all of them. UTF-16 is also what's stored in the volume catalog on Mac disks. UTF-8 is only used in BSD APIs for backward compatibility. It's also used in plain text files (or XML or HTML), again for compatibility.

Deborah

On OS X, Cocoa and Carbon use Core Foundation, whose API does not have a one-true-encoding internally. Follow the rather long URL for details: http://developer.apple.com/documentation/CoreFoundation/Conceptual/CFStrings... I would vote for an API that not just hides the internal store, but allows different internal stores to be used in a mostly compatible way. However, There is a UniChar typedef on OS X which is the same unsigned 16 bit integer as Java's JNI would use. -- Chris

Deborah Goldsmith

3:02 p.m.

On Oct 2, 2007, at 5:11 AM, ChrisK wrote:

...

Deborah Goldsmith wrote:

...
UTF-16 is the native encoding used for Cocoa, Java, ICU, and Carbon, and is what appears in the APIs for all of them. UTF-16 is also what's stored in the volume catalog on Mac disks. UTF-8 is only used in BSD APIs for backward compatibility. It's also used in plain text files (or XML or HTML), again for compatibility.

Deborah

On OS X, Cocoa and Carbon use Core Foundation, whose API does not have a one-true-encoding internally. Follow the rather long URL for details:

http://developer.apple.com/documentation/CoreFoundation/Conceptual/ CFStrings/index.html?http://developer.apple.com/documentation/ CoreFoundation/Conceptual/CFStrings/Articles/StringStorage.html#// apple_ref/doc/uid/20001179

I would vote for an API that not just hides the internal store, but allows different internal stores to be used in a mostly compatible way.

However, There is a UniChar typedef on OS X which is the same unsigned 16 bit integer as Java's JNI would use.

UTF-16 is the type used in all the APIs. Everything else is considered an encoding conversion. CoreFoundation uses UTF-16 internally except when the string fits entirely in a single-byte legacy encoding like MacRoman or MacCyrillic. If any kind of Unicode processing needs to be done to the string, it is first coerced to UTF-16. If it weren't for backwards compatibility issues, I think we'd use UTF-16 all the time as the machinery for switching encodings adds complexity. I wouldn't advise it for a new library. Deborah

Jonathan Cast

3:44 p.m.

On Tue, 2007-10-02 at 08:02 -0700, Deborah Goldsmith wrote:

...

On Oct 2, 2007, at 5:11 AM, ChrisK wrote:

...
Deborah Goldsmith wrote:

...
UTF-16 is the native encoding used for Cocoa, Java, ICU, and Carbon, and is what appears in the APIs for all of them. UTF-16 is also what's stored in the volume catalog on Mac disks. UTF-8 is only used in BSD APIs for backward compatibility. It's also used in plain text files (or XML or HTML), again for compatibility.

Deborah

On OS X, Cocoa and Carbon use Core Foundation, whose API does not have a one-true-encoding internally. Follow the rather long URL for details:

http://developer.apple.com/documentation/CoreFoundation/Conceptual/ CFStrings/index.html?http://developer.apple.com/documentation/ CoreFoundation/Conceptual/CFStrings/Articles/StringStorage.html#// apple_ref/doc/uid/20001179

I would vote for an API that not just hides the internal store, but allows different internal stores to be used in a mostly compatible way.

However, There is a UniChar typedef on OS X which is the same unsigned 16 bit integer as Java's JNI would use.

UTF-16 is the type used in all the APIs. Everything else is considered an encoding conversion.

CoreFoundation uses UTF-16 internally except when the string fits entirely in a single-byte legacy encoding like MacRoman or MacCyrillic. If any kind of Unicode processing needs to be done to the string, it is first coerced to UTF-16. If it weren't for backwards compatibility issues, I think we'd use UTF-16 all the time as the machinery for switching encodings adds complexity. I wouldn't advise it for a new library.

I would like to, again, strongly argue against sacrificing compatibility with Linux/BSD/etc. for the sake of compatibility with OS X or Windows. FFI bindings have to convert data formats in any case; Haskell shouldn't gratuitously break Linux support (or make life harder on Linux) just to support proprietary operating systems better. Now, if /independent of the details of MacOS X/, UTF-16 is better (objectively), it can be converted to anything by the FFI. But doing it the way Java or MacOS X or Win32 or anyone else does it, at the expense of Linux, I am strongly opposed to. jcc

Miguel Mitrofanov

6:05 p.m.

...

I would like to, again, strongly argue against sacrificing compatibility with Linux/BSD/etc. for the sake of compatibility with OS X or Windows.

Ehm? I've used to think MacOS is a sort of BSD...

Jonathan Cast

8:06 p.m.

On Tue, 2007-10-02 at 22:05 +0400, Miguel Mitrofanov wrote:

...

...
I would like to, again, strongly argue against sacrificing compatibility with Linux/BSD/etc. for the sake of compatibility with OS X or Windows.

Ehm? I've used to think MacOS is a sort of BSD...

Cocoa, then. jcc

Deborah Goldsmith

10:04 p.m.

On Oct 2, 2007, at 8:44 AM, Jonathan Cast wrote:

...

I would like to, again, strongly argue against sacrificing compatibility with Linux/BSD/etc. for the sake of compatibility with OS X or Windows. FFI bindings have to convert data formats in any case; Haskell shouldn't gratuitously break Linux support (or make life harder on Linux) just to support proprietary operating systems better.

Now, if /independent of the details of MacOS X/, UTF-16 is better (objectively), it can be converted to anything by the FFI. But doing it the way Java or MacOS X or Win32 or anyone else does it, at the expense of Linux, I am strongly opposed to.

No one is advocating that. Any Unicode support library needs to support exporting text as UTF-8 since it's so widely used. It's used on Mac OS X, too, in exactly the same contexts it would be used on Linux. However, UTF-8 is a poor choice for internal representation. On Oct 2, 2007, at 2:32 PM, Stefan O'Rear wrote:

...

UTF-8 supports CJK languages too. The only question is efficiency, and I believe CJK is still a relatively uncommon case compared to English and other Latin-alphabet languages. (That said, I live in a country all of whose dominant languages use the Latin alphabet)

First of all, non-Latin countries already represent a large fraction of computer usage and the computer market. It is not at all "relatively uncommon." Japan alone is a huge market. China is a huge market. Second, it's not just CJK, but anything that's not mostly ASCII. Russian, Greek, Thai, Arabic, Hebrew, etc. etc. etc. UTF-8 is intended for compatibility with existing software that expects multibyte encodings. It doesn't work well as an internal representation. Again, no one is saying a Unicode library shouldn't have full support for input and output of UTF-8 (and other encodings). If you want to process ASCII text and squeeze out every last ounce of performance, use byte strings. Unicode strings should be optimized for representing and processing human language text, a large share of which is not in the Latin alphabet. Remember, speakers of English and other Latin-alphabet languages are a minority in the world, though not in the computer-using world. Yet. Deborah

Stefan O'Rear

8:53 p.m.

On Tue, Oct 02, 2007 at 08:02:30AM -0700, Deborah Goldsmith wrote:

...

UTF-16 is the type used in all the APIs. Everything else is considered an encoding conversion.

CoreFoundation uses UTF-16 internally except when the string fits entirely in a single-byte legacy encoding like MacRoman or MacCyrillic. If any kind of Unicode processing needs to be done to the string, it is first coerced to UTF-16. If it weren't for backwards compatibility issues, I think we'd use UTF-16 all the time as the machinery for switching encodings adds complexity. I wouldn't advise it for a new library.

I do not believe that anyone was seriously advocating multiple blessed encodings. The main question is *which* encoding to bless. 99+% of text I encounter is in US-ASCII, so I would favor UTF-8. Why is UTF-16 better for me? Stefan

Johan Tibell

9:05 p.m.

...

I do not believe that anyone was seriously advocating multiple blessed encodings. The main question is *which* encoding to bless. 99+% of text I encounter is in US-ASCII, so I would favor UTF-8. Why is UTF-16 better for me?

All software I write professional have to support 40 languages (including CJK ones) so I would prefer UTF-16 in case I could use Haskell at work some day in the future. I dunno that who uses what encoding the most is good grounds to pick encoding though. Ease of implementation and speed on some representative sample set of text may be. -- Johan

Stefan O'Rear

9:32 p.m.

On Tue, Oct 02, 2007 at 11:05:38PM +0200, Johan Tibell wrote:

...

...
I do not believe that anyone was seriously advocating multiple blessed encodings. The main question is *which* encoding to bless. 99+% of text I encounter is in US-ASCII, so I would favor UTF-8. Why is UTF-16 better for me?

All software I write professional have to support 40 languages (including CJK ones) so I would prefer UTF-16 in case I could use Haskell at work some day in the future. I dunno that who uses what encoding the most is good grounds to pick encoding though. Ease of implementation and speed on some representative sample set of text may be.

UTF-8 supports CJK languages too. The only question is efficiency, and I believe CJK is still a relatively uncommon case compared to English and other Latin-alphabet languages. (That said, I live in a country all of whose dominant languages use the Latin alphabet) Stefan

Isaac Dupree

3 Oct 3 Oct

1:12 a.m.

Stefan O'Rear wrote:

...

On Tue, Oct 02, 2007 at 11:05:38PM +0200, Johan Tibell wrote:

...
...
I do not believe that anyone was seriously advocating multiple blessed encodings. The main question is *which* encoding to bless. 99+% of text I encounter is in US-ASCII, so I would favor UTF-8. Why is UTF-16 better for me? All software I write professional have to support 40 languages (including CJK ones) so I would prefer UTF-16 in case I could use Haskell at work some day in the future. I dunno that who uses what encoding the most is good grounds to pick encoding though. Ease of implementation and speed on some representative sample set of text may be.

UTF-8 supports CJK languages too. The only question is efficiency

Due to the additional complexity of handling UTF-8 -- EVEN IF the actual text processed happens all to be US-ASCII -- will UTF-8 perhaps be less efficient than UTF-16, or only as fast? Isaac

Brandon S. Allbery KF8NH

1:45 a.m.

On Oct 2, 2007, at 21:12 , Isaac Dupree wrote:

...

Stefan O'Rear wrote:

...
On Tue, Oct 02, 2007 at 11:05:38PM +0200, Johan Tibell wrote:

...
...
I do not believe that anyone was seriously advocating multiple blessed encodings. The main question is *which* encoding to bless. 99+ % of text I encounter is in US-ASCII, so I would favor UTF-8. Why is UTF-16 better for me? All software I write professional have to support 40 languages (including CJK ones) so I would prefer UTF-16 in case I could use Haskell at work some day in the future. I dunno that who uses what encoding the most is good grounds to pick encoding though. Ease of implementation and speed on some representative sample set of text may be. UTF-8 supports CJK languages too. The only question is efficiency

Due to the additional complexity of handling UTF-8 -- EVEN IF the actual text processed happens all to be US-ASCII -- will UTF-8 perhaps be less efficient than UTF-16, or only as fast?

UTF8 will be very slightly faster in the all-ASCII case, but quickly blows chunks if you have *any* characters that require multibyte. Given the way UTF8 encoding works, this includes even Latin-1 non- ASCII, never mind CJK. (I think people have been missing that point. UTF8 is only cheap for 00-7f, *nothing else*.) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Ketil Malde

6:30 a.m.

On Tue, 2007-10-02 at 21:45 -0400, Brandon S. Allbery KF8NH wrote:

...

...
Due to the additional complexity of handling UTF-8 -- EVEN IF the actual text processed happens all to be US-ASCII -- will UTF-8 perhaps be less efficient than UTF-16, or only as fast?

...

UTF8 will be very slightly faster in the all-ASCII case, but quickly blows chunks if you have *any* characters that require multibyte.

What benchmarks are you basing this on? Doubling your data size is going to cost you if you are doing simple operations (searching, say), but I don't see UTF-8 being particularly expensive - somebody (forget who) implemented UTF-8 on top of ByteString, and IIRC, the benchmarks numbers didn't change all that much from the regular Char8. -k

Ketil Malde

6:27 a.m.

On Tue, 2007-10-02 at 14:32 -0700, Stefan O'Rear wrote:

...

UTF-8 supports CJK languages too. The only question is efficiency, and I believe CJK is still a relatively uncommon case compared to English and other Latin-alphabet languages. (That said, I live in a country all of whose dominant languages use the Latin alphabet)

As for space efficiency, I guess the argument could be made that since an ideogram typically conveys a whole word, it is reasonably to spend more bits for it. Anyway, I am unsure if I should take part in this discussion, as I'm not really dealing with text as such in multiple languages. Most of my data is in ASCII, and when they are not, I'm happy to treat it ("treat" here meaning "mostly ignore") as Latin1 bytes (current ByteString) or UTF-8. The only thing I miss is the ability to use String syntactic sugar -- but IIUC, that's coming? However, increased space usage is not acceptable, and I also don't want any conversion layer which could conceivably modify my data (e.g. by normalizing or error handling). -k

Twan van Laarhoven

2 Oct 2 Oct

10:01 p.m.

Lots of people wrote:

...

I want a UTF-8 bikeshed! No, I want a UTF-16 bikeshed!

What the heck does it matter what encoding the library uses internally? I expect the interface to be something like (from my own CompactString library):

...

fromByteString :: Encoding -> ByteString -> UnicodeString toByteString :: Encoding -> UnicodeString -> ByteString The only matter is efficiency for a particular encoding.

I would suggest that we get a working library first. Either UTF-8 or UTF-16 will do, as long as it works. Even better would be to implement both (and perhaps more encodings), and then benchmark them to get a sensible default. Then the choice can be made available to the user as well, in case someone has specifix needs. But again: get it working first! Twan

Deborah Goldsmith

10:20 p.m.

On Oct 2, 2007, at 3:01 PM, Twan van Laarhoven wrote:

...

Lots of people wrote:

...
I want a UTF-8 bikeshed! No, I want a UTF-16 bikeshed!

What the heck does it matter what encoding the library uses internally? I expect the interface to be something like (from my own CompactString library):

...
fromByteString :: Encoding -> ByteString -> UnicodeString toByteString :: Encoding -> UnicodeString -> ByteString

I agree, from an API perspective the internal encoding doesn't matter.

...

The only matter is efficiency for a particular encoding.

This matters a lot.

...

I would suggest that we get a working library first. Either UTF-8 or UTF-16 will do, as long as it works.

Even better would be to implement both (and perhaps more encodings), and then benchmark them to get a sensible default. Then the choice can be made available to the user as well, in case someone has specifix needs. But again: get it working first!

The problem is that the internal encoding can have a big effect on the implementation of the library. It's better not to have to do it over again if the first choice is not optimal. I'm just trying to share the experience of the Unicode Consortium, the ICU library contributors, and Apple, with the Haskell community. They, and I personally, have many years of experience implementing support for Unicode. Anyway, I think we're starting to repeat ourselves... Deborah

Jonathan Cast

11:11 p.m.

On Wed, 2007-10-03 at 00:01 +0200, Twan van Laarhoven wrote:

...

Lots of people wrote:

...
I want a UTF-8 bikeshed! No, I want a UTF-16 bikeshed!

What the heck does it matter what encoding the library uses internally?

+1 jcc

Stephane Bortzmeyer

3 Oct 3 Oct

12:15 p.m.

On Wed, Oct 03, 2007 at 12:01:50AM +0200, Twan van Laarhoven wrote a message of 24 lines which said:

...

Lots of people wrote:

...
I want a UTF-8 bikeshed! No, I want a UTF-16 bikeshed!

Personnally, I want an UTF-32 bikeshed. UTF-16 is as lousy as UTF-8 (for both of them, characters have different sizes, unlike what happens in UTF-32).

...

What the heck does it matter what encoding the library uses internally?

+1 It can even use a non-standard encoding scheme if it wants.

Johan Tibell

12:50 p.m.

...

...
What the heck does it matter what encoding the library uses internally?

+1 It can even use a non-standard encoding scheme if it wants.

Sounds good to me. I (think) one of my initial questions was if the encoding should be visible in the type of the UnicodeString type or not. My gut feeling is that having the type visible might make it hard to change the internal representation but I haven't yet got a good example to prove this. -- Johan

Jonathan Cast

3:55 p.m.

On Wed, 2007-10-03 at 14:15 +0200, Stephane Bortzmeyer wrote:

...

On Wed, Oct 03, 2007 at 12:01:50AM +0200, Twan van Laarhoven wrote a message of 24 lines which said:

...
Lots of people wrote:

...
I want a UTF-8 bikeshed! No, I want a UTF-16 bikeshed!

Personnally, I want an UTF-32 bikeshed. UTF-16 is as lousy as UTF-8 (for both of them, characters have different sizes, unlike what happens in UTF-32).

...

...
What the heck does it matter what encoding the library uses internally?

+1 It can even use a non-standard encoding scheme if it wants.

+3 jcc

Ross Paterson

27 Sep 27 Sep

6:55 a.m.

On Wed, Sep 26, 2007 at 11:25:30AM +0100, Tony Finch wrote:

...

On Wed, 26 Sep 2007, Aaron Denney wrote:

...
It's true that time-wise there are definite issues in finding character boundaries.

UTF-16 has no advantage over UTF-8 in this respect, because of surrogate pairs and combining characters.

Combining characters are not an issue here, just the surrogate pairs, because we're discussing representations of sequences of Chars (Unicode code points).

Aaron Denney

7:26 a.m.

On 2007-09-27, Ross Paterson wrote:

...

Combining characters are not an issue here, just the surrogate pairs, because we're discussing representations of sequences of Chars (Unicode code points).

You'll never want to combine combining characters or vice-versa? Never want to figure out how much screen space a sequence will take? It _is_ an issue. -- Aaron Denney -><-

Ross Paterson

7:38 a.m.

On Thu, Sep 27, 2007 at 07:26:07AM +0000, Aaron Denney wrote:

...

On 2007-09-27, Ross Paterson wrote:

...
Combining characters are not an issue here, just the surrogate pairs, because we're discussing representations of sequences of Chars (Unicode code points).

You'll never want to combine combining characters or vice-versa? Never want to figure out how much screen space a sequence will take? It _is_ an issue.

It's an issue for a higher layer, not for a compact String representation.

Aaron Denney

8:36 a.m.

On 2007-09-27, Ross Paterson wrote:

...

On Thu, Sep 27, 2007 at 07:26:07AM +0000, Aaron Denney wrote:

...
On 2007-09-27, Ross Paterson wrote:

...
Combining characters are not an issue here, just the surrogate pairs, because we're discussing representations of sequences of Chars (Unicode code points).

You'll never want to combine combining characters or vice-versa? Never want to figure out how much screen space a sequence will take? It _is_ an issue.

It's an issue for a higher layer, not for a compact String representation.

Yes, and no. It's not something the lower layer should be doing, but enabling the higher layers to do so efficiently is a concern. -- Aaron Denney -><-

Tony Finch

6:39 p.m.

On Thu, 27 Sep 2007, Ross Paterson wrote:

...

Combining characters are not an issue here, just the surrogate pairs, because we're discussing representations of sequences of Chars (Unicode code points).

I dislike referring to unicode code points as "characters" because that tends to imply a lot of invalid simplifications. Tony. -- f.a.n.finch http://dotat.at/ IRISH SEA: SOUTHERLY, BACKING NORTHEASTERLY FOR A TIME, 3 OR 4. SLIGHT OR MODERATE. SHOWERS. MODERATE OR GOOD, OCCASIONALLY POOR.

Duncan Coutts

7:15 p.m.

In message Tony Finch writes:

...

On Thu, 27 Sep 2007, Ross Paterson wrote:

...
Combining characters are not an issue here, just the surrogate pairs, because we're discussing representations of sequences of Chars (Unicode code points).

I dislike referring to unicode code points as "characters" because that tends to imply a lot of invalid simplifications.

Just to be pedantic, Ross did say Char not character. A Char is defined in the Haskell report as a Unicode code point. As you say, that does not directly correspond to what many people think of as a character due to combining characters etc. Duncan

ok

6:09 a.m.

New subject: 'data' syntax - a suggestion

I have often found myself wishing for a small extension to the syntax of Haskell 'data' declarations. It goes like this: data <as usual> = <as usual> | ... | <as usual> +++ where type <tvar> = <type> type <tvar> = <type> ... deriving <as usual> Even something like binary search trees would, to me, be clearer as data BST key val = Empty | Fork key val bst bst where type bst = BST key val because this establishes an *essential* identity between the 3rd and 4th Fork argument types, rather than an "accidental" identity. I can't set this up using 'type' outside the 'data' declaration, because it has to refer to the type arguments of BST. Semantically, this is just an abbreviation mechanism with no consequences of any kind outside the 'data' declaration itself. The only point would be to make reading and writing 'data' declarations easier, especially large ones.

Thomas Conway

9:18 a.m.

New subject: 'data' syntax - a suggestion

On 9/27/07, ok wrote:

...

I have often found myself wishing for a small extension to the syntax of Haskell 'data' declarations. It goes like this: ['where' clause to allow locally defined names in type declarations]

Nice. Quite a few times I've found myself declaring type synonyms for this reason, but you end up polluting the global namespace. +1 vote. -- Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

jerzy.karczmarczuk＠info.unicaen.fr

9:33 a.m.

New subject: 'data' syntax - a suggestion

Thomas Conway writes:

...

On 9/27/07, ok wrote:

...
I have often found myself wishing for a small extension to the syntax of Haskell 'data' declarations. It goes like this: ['where' clause to allow locally defined names in type declarations]

Nice.

Quite a few times I've found myself declaring type synonyms for this reason, but you end up polluting the global namespace.

+1 vote.

Data with where? You haven't heard about GADTs? http://en.wikibooks.org/wiki/Haskell/GADT http://www.haskell.org/haskellwiki/Generalised_algebraic_datatype Jerzy Karczmarczuk

Tomasz Zielonka

11:33 a.m.

New subject: 'data' syntax - a suggestion

On 9/27/07, jerzy.karczmarczuk@info.unicaen.fr wrote:

...

Thomas Conway writes:

...
On 9/27/07, ok wrote:

...
I have often found myself wishing for a small extension to the syntax of Haskell 'data' declarations. It goes like this: ['where' clause to allow locally defined names in type declarations]

Nice.

Quite a few times I've found myself declaring type synonyms for this reason, but you end up polluting the global namespace.

+1 vote.

Data with where? You haven't heard about GADTs?

I think that you haven't read the question carefully, because "where" in GADTs is simply a syntactic sugar. However, this seems to be available already with GADTs and type equality constraints: data BST key val where Empty :: BST key val Fork :: (bst ~ BST key val) => key -> val -> bst -> bst -> BST key val It's a pity you can't use bst (or a type synonym) instead of the last "BST key val". Best regards Tomasz

Isaac Dupree

1:03 p.m.

New subject: 'data' syntax - a suggestion

Tomasz Zielonka wrote:

...

On 9/27/07, jerzy.karczmarczuk@info.unicaen.fr wrote:

...
Thomas Conway writes:

...
On 9/27/07, ok wrote:

...
I have often found myself wishing for a small extension to the syntax of Haskell 'data' declarations. It goes like this: ['where' clause to allow locally defined names in type declarations]

Nice.

Quite a few times I've found myself declaring type synonyms for this reason, but you end up polluting the global namespace.

+1 vote. Data with where? You haven't heard about GADTs?

I think that you haven't read the question carefully, because "where" in GADTs is simply a syntactic sugar. However, this seems to be available already with GADTs and type equality constraints:

data BST key val where Empty :: BST key val Fork :: (bst ~ BST key val) => key -> val -> bst -> bst -> BST key val

It's a pity you can't use bst (or a type synonym) instead of the last "BST key val".

Indeed. GADT syntax looks like a type signature (except for strictness annotations, which presently aren't part of function syntax!) but apparently the (->)s and result-type aren't type-signature, because type-synonyms can't be used for them. I tried. (because there were several GADT constructors with slightly different signatures, so I made a type-synonym with an argument to try to shorten them). It seems a pity to me too. Isaac

Albert Y. C. Lai

8:07 p.m.

New subject: 'data' syntax - a suggestion

jerzy.karczmarczuk@info.unicaen.fr wrote:

...

Data with where? You haven't heard about GADTs?

To avoid clashing with GADT's "where", I propose to rename ok's keyword to "wherein", or "wheretype", or something data B k v = E | F b b wherein type b = B k v data B k v = E | F b b wheretype b = B k v (I also propose that ok should not just take an existing unrelated thread like "Unicode string library", click "reply", and herein talk about a new topic; but rather, should take the necessary extra effort to start a new thread altogether.)

David Menendez

9:36 p.m.

New subject: 'data' syntax - a suggestion

On 9/27/07, Albert Y. C. Lai wrote:

...

jerzy.karczmarczuk@info.unicaen.fr wrote:

...
Data with where? You haven't heard about GADTs?

To avoid clashing with GADT's "where", I propose to rename ok's keyword to "wherein", or "wheretype", or something

data B k v = E | F b b wherein type b = B k v

data B k v = E | F b b wheretype b = B k v

I'm not sure there is a clash. data B k v where ... is easily distinguished from data B k v = ... where ... -- Dave Menendez http://www.eyrie.org/~zednenem/

Thomas Conway

10:01 p.m.

New subject: 'data' syntax - a suggestion

On 9/28/07, David Menendez wrote:

...

I'm not sure there is a clash.

data B k v where ...

is easily distinguished from

data B k v = ... where ...

Indeed. Although Richard's proposal was simpler, I reckon it's worth discussing whether the where clause should allow normal type/data/newtype declarations, effectively introducing a new scope. There are obviously some type variable quantification and name resolution issues that should yield several conference papers. Here are a couple of examples: data Tree key val = Leaf key val | Node BST key val BST where type BST = Tree key val data RelaxedTree key val = Leaf Bal [(key,val)] | Node Bal [(key,RelaxedTree key val)] where data Bal = Balanced | Unbalanced -- Thomas Conway drtomc@gmail.com Silence is the perfectest herald of joy: I were but little happy, if I could say how much.

Dan Weston

10:28 p.m.

New subject: 'data' syntax - a suggestion

Thomas Conway wrote:

...

Although Richard's proposal was simpler, I reckon it's worth discussing whether the where clause should allow normal type/data/newtype declarations, effectively introducing a new scope. There are obviously some type variable quantification and name resolution issues that should yield several conference papers.

data RelaxedTree key val = Leaf Bal [(key,val)] | Node Bal [(key,RelaxedTree key val)] where data Bal = Balanced | Unbalanced

Is Bal visible outside data RelaxedTree? If so, why not put it at the top level. If not, are Balanced and Unbalanced visible? If not, then there is no way to construct a RelaxedTree. If so, then you could not give a type annotation to x = Balanced.

...

data Tree key val = Leaf key val | Node BST key val BST where type BST = Tree key val

The type synonym example is much easier because it is effectively syntactic sugar, and although BST is not visible, Tree key val is. But is let allowed as well, if we want to restrict the visibility of BST to just the Node constructor? Type synomym of a type variable OK? data Tree key val = let BST = key in Leaf BST val -- perversely called BST | let BST = Tree key val in Node BST key val BST

ok

28 Sep 28 Sep

12:27 a.m.

New subject: 'data' syntax - a suggestion

On 28 Sep 2007, at 10:01 am, Thomas Conway wrote:

...

data Tree key val = Leaf key val | Node BST key val BST where type BST = Tree key val

data RelaxedTree key val = Leaf Bal [(key,val)] | Node Bal [(key,RelaxedTree key val)] where data Bal = Balanced | Unbalanced

My proposal was deliberately rather limited. My feeling was that if there is a constructor (like Balanced, Unbalanced) then I want it to belong to a module-scope type name. What I'm looking for is something that provides (1) an easily understood way of abbreviating repeated types in a data, type, or newtype declaration (2) and using them *uniformly* throughout such a declaration (which is why GADTs don't help) (3) to reduce the incidence of errors (4) and clarify the programmer's intent in much the same way as field names do (but as a complementary, not a rival technique) (5) and above all, to simplify maintenance. The thing that got me thinking about this is my continuing attempt to write a compiler (to C) for a legacy language in Haskell. I start out with a simple AST data type, adequate for testing the grammar. And then I start adding semantic information to the nodes, and suddenly I find myself adding extra fields all over the place. Now there's a paper that was mentioned about a month ago in this mailing list which basically dealt with that by splitting each type into two: roughly speaking a bit that expresses the recursion and a bit that expresses the choice structure. My feeling about that was that while it is a much more powerful and general technique, it isn't as easy to get your head around as a single level solution. Here's a trivial example. Parser-only version: newtype Var = Var String data Expr = Variable Var | Constant Int | Unary String Expr | Binary String Expr Revised version: data Var env = Var env String data Expr env = Variable (Var env) | Constant Int | Unary String (Expr env) | Binary String (Expr env) (Expr env) Now let's do Expr using my proposal: data Expr = Variable var | Constant Int | Unary String expr | Binary String expr expr where type var = Var type expr = Expr (obtained from the first parser-only version by lower-casing the type names) becoming * data Expr env = Variable var | Constant Int | Unary String expr | Binary String expr expr * where type var = Var env * type expr = Expr env To my mind it's clearer to see 'expr' repeated than '(Expr env)' repeated, as you don't have to keep checking that the argument is the same. I'm not wedded to this scheme. It's the simplest thing I can think of that will do the job. But the Haskell spirit is, if I may say so, seems to be to look for the simplest thing that can do the job at hand and a whole lot more in a principled way. What I'm looking for in a better counter-proposal is something that makes it this easy or easier to revise and extend a type. Perhaps a variation on GADTs would be the way to go. I don't know.

Bas van Dijk

7:59 a.m.

New subject: 'data' syntax - a suggestion

On 9/28/07, ok wrote:

...

Now there's a paper that was mentioned about a month ago in this mailing list which basically dealt with that by splitting each type into two: roughly speaking a bit that expresses the recursion and a bit that expresses the choice structure.

Would you like to give a link to that paper? (the following is a bit offtopic) In the 1995 paper[1]: "Bananas in Space: Extending Fold and Unfold to Exponential Types", Erik Meijer and Graham Hutton showed a interesting technique: Your ADT: data Expr env = Variable (Var env) | Constant Int | Unary String (Expr env) | Binary String (Expr env) (Expr env) can be written without recursion by using a fixpoint newtype combinator (not sure if this is the right name for it): newtype Rec f = In { out :: f (Rec f) } data Var env = Var env String data E env e = Variable (Var env) | Constant Int | Unary String e | Binary String e e type Expr env = Rec (E env) example = In (Binary "+" (In (Constant 1)) (In (Constant 2))) You can see that you don't have to name the recursive 'Expr env' explicitly. However constructing a 'Expr' is a bit verbose because of the 'In' newtype constructors. regards, Bas van Dijk [1] http://citeseer.ist.psu.edu/293490.html

Johan Tibell

26 Sep 26 Sep

7:05 a.m.

...

I'll look over the proposal more carefully when I get time, but the most important issue is to not let the storage type leak into the interface.

Agreed,

...

From an implementation point of view, UTF-16 is the most efficient representation for processing Unicode. It's the native Unicode representation for Windows, Mac OS X, and the ICU open source i18n library. UTF-8 is not very efficient for anything except English. Its most valuable property is compatibility with software that thinks of character strings as byte arrays, and in fact that's why it was invented.

If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise.

Aaron Denney

7:25 a.m.

On 2007-09-26, Johan Tibell wrote:

...

If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise.

The internal representations don't matter except in the case of making FFI linkages. The external representations do, and UTF-8 has won on that front. -- Aaron Denney -><-

Johan Tibell

10:15 a.m.

On 9/26/07, Aaron Denney wrote:

...

On 2007-09-26, Johan Tibell wrote:

...
If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise.

The internal representations don't matter except in the case of making FFI linkages. The external representations do, and UTF-8 has won on that front.

It could matter for performance. However, you can encode your UnicodeString into any external representation you want for your I/O needs, including UTF-8.

Aaron Denney

6:05 p.m.

On 2007-09-26, Johan Tibell wrote:

...

On 9/26/07, Aaron Denney wrote:

...
On 2007-09-26, Johan Tibell wrote:

...
If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise.

The internal representations don't matter except in the case of making FFI linkages. The external representations do, and UTF-8 has won on that front.

It could matter for performance. However, you can encode your UnicodeString into any external representation you want for your I/O needs, including UTF-8.

Right. I was trying to say "other languages internal representations shouldn't affect the choice of those doing a Haskell implementation." -- Aaron Denney -><-

Jonathan Cast

4:44 p.m.

On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:

...

...
I'll look over the proposal more carefully when I get time, but the most important issue is to not let the storage type leak into the interface.

Agreed,

...
From an implementation point of view, UTF-16 is the most efficient representation for processing Unicode. It's the native Unicode representation for Windows, Mac OS X, and the ICU open source i18n library. UTF-8 is not very efficient for anything except English. Its most valuable property is compatibility with software that thinks of character strings as byte arrays, and in fact that's why it was invented.

If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise.

I disagree. I realize I'm a dissenter in this regard, but my position is: excellent Unix support first, portability second, excellent support for Win32/MacOS a distant third. That seems to be the opposite of every language's position. Unix absolutely needs UTF-8 for backward compatibility. jcc

Duncan Coutts

5:46 p.m.

In message <1190825044.9435.1.camel@jcchost> Jonathan Cast writes:

...

On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:

...

...
If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise.

I disagree. I realize I'm a dissenter in this regard, but my position is: excellent Unix support first, portability second, excellent support for Win32/MacOS a distant third. That seems to be the opposite of every language's position. Unix absolutely needs UTF-8 for backward compatibility.

I think you're talking about different things, internal vs external representations. Certainly we must support UTF-8 as an external representation. The choice of internal representation is independent of that. It could be [Char] or some memory efficient packed format in a standard encoding like UTF-8,16,32. The choice depends mostly on ease of implementation and performance. Some formats are easier/faster to process but there are also conversion costs so in some use cases there is a performance benefit to the internal representation being the same as the external representation. So, the obvious choices of internal representation are UTF-8 and UTF-16. UTF-8 has the advantage of being the same as a common external representation so conversion is cheap (only need to validate rather than copy). UTF-8 is more compact for western languages but less compact for eastern languages compared to UTF-16. UTF-8 is a more complex encoding in the common cases than UTF-16. In the common case UTF-16 is effectively fixed width. According to the ICU implementors this has speed advantages (probably due to branch prediction and smaller code size). One solution is to do both and benchmark them. Duncan

Jonathan Cast

5:54 p.m.

On Wed, 2007-09-26 at 18:46 +0100, Duncan Coutts wrote:

...

In message <1190825044.9435.1.camel@jcchost> Jonathan Cast writes:

...
On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:

...
...
If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise.

I disagree. I realize I'm a dissenter in this regard, but my position is: excellent Unix support first, portability second, excellent support for Win32/MacOS a distant third. That seems to be the opposite of every language's position. Unix absolutely needs UTF-8 for backward compatibility.

I think you're talking about different things, internal vs external representations.

Certainly we must support UTF-8 as an external representation. The choice of internal representation is independent of that. It could be [Char] or some memory efficient packed format in a standard encoding like UTF-8,16,32. The choice depends mostly on ease of implementation and performance. Some formats are easier/faster to process but there are also conversion costs so in some use cases there is a performance benefit to the internal representation being the same as the external representation.

So, the obvious choices of internal representation are UTF-8 and UTF-16. UTF-8 has the advantage of being the same as a common external representation so conversion is cheap (only need to validate rather than copy). UTF-8 is more compact for western languages but less compact for eastern languages compared to UTF-16. UTF-8 is a more complex encoding in the common cases than UTF-16. In the common case UTF-16 is effectively fixed width. According to the ICU implementors this has speed advantages (probably due to branch prediction and smaller code size).

One solution is to do both and benchmark them.

OK, right. jcc

ok

27 Sep 27 Sep

6:40 a.m.

On 26 Sep 2007, at 7:05 pm, Johan Tibell wrote:

...

If UTF-16 is what's used by everyone else (how about Java? Python?) I think that's a strong reason to use it. I don't know Unicode well enough to say otherwise.

Java uses 16-bit variables to hold characters. This is SOLELY for historical reasons, not because it is a good choice. The history is a bit funny: the ISO 10646 group were working away defining a 31-bit character set, and the industry screamed blue murder about how this was going to ruin the economy, bring back the Dark Ages, &c, and promptly set up the Unicode consortium to define a 16-bit character set that could do the same job. Early versions of Unicode had only about 30 000 characters, after heroic (and not entirely appreciated) efforts at unifiying Chinese characters as used in China with those used in Japan and those used in Korea. They also lumbered themselves (so that they would have a fighting chance of getting Unicode adopted) with a "round trip conversion" policy, namely that it should be possible to take characters using ANY current encoding standard, convert them to Unicode, and then convert back to the original encoding with no loss of information. This led to failure of unification: there are two versions of Å (one for ordinary use, one for Angstroms), two versions of mu (one for Greek, one for micron), three complete copies of ASCII, &c). However, 16 bits really is not enough. Here's a table from http://www.unicode.org/versions/Unicode5.0.0/ Graphic 98,884 Format 140 Control 65 Private Use 137,468 Surrogate 2,048 Noncharacter 66 Reserved 875,441 Excluding Private Use and Reserved, I make that 101,203 currently defined codes. That's nearly 1.5* the number that would fit in 16 bits. Java has had to deal with this, don't think it hasn't. For example, where Java had one set of functions referring to characters in strings by position, it now has two complete sets: one to use *which 16-bit code* (which is fast) and one to use *which actual Unicode character* (which is slow). The key point is that the second set is *always* slow even when there are no characters outside the basic multilingual plane. One Smalltalk system I sometimes use has three complete string implementations (all characters fit in a byte, all characters fit in 16 bits, some characters require more) and dynamically switches from narrow strings to wide strings behind your back. In a language with read-only strings, that makes a lot of sense; it's just a pity Smalltalk isn't one. If you want to minimize conversion effort when talking to the operating system, files, and other programs, UTF-8 is probably the way to go. (That's on Unix. For Windows it might be different.) If you want to minimize the effort of recognising character boundaries while processing strings, 32-bit characters are the way to go. If you want to be able to index into a string efficiently, they are the *only* way to go. Solaris bit the bullet many years ago; Sun C compilers jumped straight from 8-bit wchar_t to 32_bit without ever stopping at 16. 16-bit characters *used* to be a reasonable compromise, but aren't any longer. Unicode keeps on growing. There were 1,349 new characters from Unicode 4.1 to Unicode 5.0 (IIRC). There are lots more scripts in the pipeline. (What the heck _is_ Tangut, anyway?)