On Feb 12, 2018 10:57 AM, "Joachim Durchholz" <jo@durchholz.org> wrote:

Am 11.02.2018 um 12:29 schrieb Merijn Verstraaten:

On 11 Feb 2018, at 10:39, Alan & Kim Zimmerman <alan.zimm@gmail.com> wrote:

What is the current and future status of UTF8 vs UTF-16 in the haskell world?

I understand that currently Text uses UTF-16, and it is used generally because of compatibility requirements in the Microsoft ecosystem, but that there are movements afoot to move to a UTF8 only environment at some unspecified future point.

As far as I know there was a UTF-8 fork of Text made as part of the Summer of Code a year or so ago, but it got ditched because it turned out to be slower than the UTF16 version in practice.

Mmm... correctness is another relevant point here.
Does Text handle characters beyond the Basic Multilingual Plane (U+00000 to U+0FFFF) properly, do does one have to deal with "surrogate pairs" there?

I'm curious because I am seeing this kind of trouble in the Java world. The standard libraries there have pretty weak support for characters beyond 0x0FFFF, so most Java programmers pretend that these don't exist. I'm pretty sure Chinese users hate Java for that reason...

IIRC, the public Text interface works with code points, not 16-bit units. Length and indexing are O(n) for this reason.

So there should be no issues from a correctness point of view.

Chris

Regards,
Jo

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.