Haskell future and UTF8 vs UTF-16

Hi all What is the current and future status of UTF8 vs UTF-16 in the haskell world? I understand that currently Text uses UTF-16, and it is used generally because of compatibility requirements in the Microsoft ecosystem, but that there are movements afoot to move to a UTF8 only environment at some unspecified future point. The question arises as I ponder a pull request on haskell-lsp to switch to a UTF-16 based library[1] Alan [1] https://github.com/alanz/haskell-lsp/pull/70

On 11 Feb 2018, at 10:39, Alan & Kim Zimmerman
What is the current and future status of UTF8 vs UTF-16 in the haskell world?
I understand that currently Text uses UTF-16, and it is used generally because of compatibility requirements in the Microsoft ecosystem, but that there are movements afoot to move to a UTF8 only environment at some unspecified future point.
As far as I know there was a UTF-8 fork of Text made as part of the Summer of Code a year or so ago, but it got ditched because it turned out to be slower than the UTF16 version in practice. So as far as I know, there's no real plan to adopt to UTF8, especially since the internal encoding used by Text is pretty much irrelevant by most users of Text. Cheers, Merijn

Am 11.02.2018 um 12:29 schrieb Merijn Verstraaten:
On 11 Feb 2018, at 10:39, Alan & Kim Zimmerman
wrote: What is the current and future status of UTF8 vs UTF-16 in the haskell world?
I understand that currently Text uses UTF-16, and it is used generally because of compatibility requirements in the Microsoft ecosystem, but that there are movements afoot to move to a UTF8 only environment at some unspecified future point.
As far as I know there was a UTF-8 fork of Text made as part of the Summer of Code a year or so ago, but it got ditched because it turned out to be slower than the UTF16 version in practice. Mmm... correctness is another relevant point here. Does Text handle characters beyond the Basic Multilingual Plane (U+00000 to U+0FFFF) properly, do does one have to deal with "surrogate pairs" there?
I'm curious because I am seeing this kind of trouble in the Java world. The standard libraries there have pretty weak support for characters beyond 0x0FFFF, so most Java programmers pretend that these don't exist. I'm pretty sure Chinese users hate Java for that reason... Regards, Jo

On Feb 12, 2018 10:57 AM, "Joachim Durchholz"
On 11 Feb 2018, at 10:39, Alan & Kim Zimmerman
wrote: What is the current and future status of UTF8 vs UTF-16 in the haskell world?
I understand that currently Text uses UTF-16, and it is used generally because of compatibility requirements in the Microsoft ecosystem, but that there are movements afoot to move to a UTF8 only environment at some unspecified future point.
As far as I know there was a UTF-8 fork of Text made as part of the Summer of Code a year or so ago, but it got ditched because it turned out to be slower than the UTF16 version in practice.
Mmm... correctness is another relevant point here. Does Text handle characters beyond the Basic Multilingual Plane (U+00000 to U+0FFFF) properly, do does one have to deal with "surrogate pairs" there? I'm curious because I am seeing this kind of trouble in the Java world. The standard libraries there have pretty weak support for characters beyond 0x0FFFF, so most Java programmers pretend that these don't exist. I'm pretty sure Chinese users hate Java for that reason... IIRC, the public Text interface works with code points, not 16-bit units. Length and indexing are O(n) for this reason. So there should be no issues from a correctness point of view. Chris Regards, Jo _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

I'd actually been thinking about whether it'd be worth it to include a
fingertree of character lengths in order to make length O(1) and
indexing, take, and drop O(log n). However, a Text is currently three
unpacked values, and putting something that can't be unboxed in there
may not be such a good idea.
On Sun, Feb 11, 2018 at 5:51 PM, Chris Wong
On Feb 12, 2018 10:57 AM, "Joachim Durchholz"
wrote: Am 11.02.2018 um 12:29 schrieb Merijn Verstraaten:
On 11 Feb 2018, at 10:39, Alan & Kim Zimmerman
wrote: What is the current and future status of UTF8 vs UTF-16 in the haskell world?
I understand that currently Text uses UTF-16, and it is used generally because of compatibility requirements in the Microsoft ecosystem, but that there are movements afoot to move to a UTF8 only environment at some unspecified future point.
As far as I know there was a UTF-8 fork of Text made as part of the Summer of Code a year or so ago, but it got ditched because it turned out to be slower than the UTF16 version in practice.
Mmm... correctness is another relevant point here. Does Text handle characters beyond the Basic Multilingual Plane (U+00000 to U+0FFFF) properly, do does one have to deal with "surrogate pairs" there?
I'm curious because I am seeing this kind of trouble in the Java world. The standard libraries there have pretty weak support for characters beyond 0x0FFFF, so most Java programmers pretend that these don't exist. I'm pretty sure Chinese users hate Java for that reason...
IIRC, the public Text interface works with code points, not 16-bit units. Length and indexing are O(n) for this reason.
So there should be no issues from a correctness point of view.
Chris
Regards, Jo
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

On 13/02/2018, Zemyla
I'd actually been thinking about whether it'd be worth it to include a fingertree of character lengths in order to make length O(1) and indexing, take, and drop O(log n). However, a Text is currently three unpacked values, and putting something that can't be unboxed in there may not be such a good idea.
Yeah, whoever needs these operations likely ought to rather use `Vector Char` or such, or define a wrapper type including the character length information, lest we penalize all users for it.

Hi Alan, On 02/11/2018 10:39 AM, Alan & Kim Zimmerman wrote:
What is the current and future status of UTF8 vs UTF-16 in the haskell world?
The only somewhat active effort to move towards UTF-8 in `text` that I’m aware of is https://github.com/text-utf8. I’m not personally involved with that project so I can’t tell you much more but you might want to contact the authors. Cheers, Moritz

There is also Foundation.String which I heard people speak enthusiastically about https://hackage.haskell.org/package/foundation-0.0.19/docs/Foundation-String... Cheers, Adam On Sun, 11 Feb 2018 at 16:52 Moritz Kiefer < moritz.kiefer@purelyfunctional.org> wrote:
Hi Alan,
On 02/11/2018 10:39 AM, Alan & Kim Zimmerman wrote:
What is the current and future status of UTF8 vs UTF-16 in the haskell world?
The only somewhat active effort to move towards UTF-8 in `text` that I’m aware of is https://github.com/text-utf8. I’m not personally involved with that project so I can’t tell you much more but you might want to contact the authors.
Cheers, Moritz
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
participants (8)
-
Adam Bergmark
-
Alan & Kim Zimmerman
-
Chris Wong
-
Joachim Durchholz
-
M Farkas-Dyck
-
Merijn Verstraaten
-
Moritz Kiefer
-
Zemyla