Strings and utf-8

newer
Hit a wall with the type system

older
Re: [Haskell-cafe] Re: New slogan...

Maurício

27 Nov 2007 27 Nov '07

12:23 a.m.

Hi, Are 'String's in GHC 6.6.1 UTF-8? Thanks, Maurício

Show replies by date

Brandon S. Allbery KF8NH

27 Nov 27 Nov

1:29 a.m.

On Nov 26, 2007, at 19:23 , Maurí cio wrote:

...

Are 'String's in GHC 6.6.1 UTF-8?

No. type String = [Char] and Char stores Unicode codepoints. However, the IO system truncates them to 8 bits. I think there are UTF8 marshaling libraries on hackage these days, though. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Don Stewart

1:34 a.m.

allbery:

...

On Nov 26, 2007, at 19:23 , Maurí cio wrote:

...
Are 'String's in GHC 6.6.1 UTF-8?

No.

type String = [Char]

and Char stores Unicode codepoints. However, the IO system truncates them to 8 bits. I think there are UTF8 marshaling libraries on hackage these days, though.

Yep, utf8string, in particular. -- Don

Paul Johnson

6:38 p.m.

...

However, the IO system truncates [characters] to 8 bits. I Should this be considered a bug? I presume that its because was defined in the days of ASCII-only strings, and the functions in System.IO are defined in terms of . But does this need to be

Brandon S. Allbery KF8NH wrote: the case in the future? Unfortunately I don't know enough about Unicode IO to judge. Paul.

Duncan Coutts

28 Nov 28 Nov

11:26 a.m.

On Tue, 2007-11-27 at 18:38 +0000, Paul Johnson wrote:

...

Brandon S. Allbery KF8NH wrote:

...
However, the IO system truncates [characters] to 8 bits.

...

Should this be considered a bug?

A design problem.

...

I presume that its because was defined in the days of ASCII-only strings, and the functions in System.IO are defined in terms of . But does this need to be the case in the future?

When it's phrased as "truncates to 8 bits" it sounds so simple, surely all we need to do is not truncate to 8 bits right? The problem is, what encoding should it pick? UTF8, 16, 32, EBDIC? How would people specify that they really want to use a binary file. Whatever we change it'll break programs that use the existing meanings. One sensible suggestion many people have made is that H98 file IO should use the locale encoding and do Unicode/String <-> locale conversion. So that'd all be text files. Then openBinaryFile would be used for binary files. Of course then we'd need control over setting the encoding and what to do on encountering encoding errors. IMHO, someone should make a full proposal by implementing an alternative System.IO library that deals with all these encoding issues and implements H98 IO in terms of that. It doesn't have to be fast initially, it just has to get the API right and not design the API so as to exclude the possibility of a fast implementation later. Duncan

Reinier Lamers

12:34 p.m.

Duncan Coutts wrote:

...

On Tue, 2007-11-27 at 18:38 +0000, Paul Johnson wrote:

...
Brandon S. Allbery KF8NH wrote:

...
However, the IO system truncates [characters] to 8 bits.

...
Should this be considered a bug?

A design problem.

...
I presume that its because was defined in the days of ASCII-only strings, and the functions in System.IO are defined in terms of . But does this need to be the case in the future?

When it's phrased as "truncates to 8 bits" it sounds so simple, surely all we need to do is not truncate to 8 bits right?

The problem is, what encoding should it pick? UTF8, 16, 32, EBDIC? How would people specify that they really want to use a binary file. Whatever we change it'll break programs that use the existing meanings.

One sensible suggestion many people have made is that H98 file IO should use the locale encoding and do Unicode/String <-> locale conversion. So that'd all be text files. Then openBinaryFile would be used for binary files. Of course then we'd need control over setting the encoding and what to do on encountering encoding errors.

Wouldn't it be sensible not to use the H98 file I/O operations at all anymore with binary files? A Char represents a Unicode code point value and is not the right data type to use to represent a byte from a binary stream. Who wants binary I/O would have to use Data.ByteString.* and Data.Binary. So you would use System.IO.hPutStr to write a text string, and Data.ByteString.hPutStr to write a sequence of bytes. Probably, a good implementation of the earlier could be made in terms of the latter. Reinier

Maurício

7:38 p.m.

...

...
(...) When it's phrased as "truncates to 8 bits" it sounds so simple, surely all we need to do is not truncate to 8 bits right?

The problem is, what encoding should it pick? UTF8, 16, 32, EBDIC? (...)

One sensible suggestion many people have made is that H98 file IO should use the locale encoding and do Unicode/String <-> locale conversion. (...)

I'm really afraid of solutions where the behavior of your program changes with an environment variable that not everybody has configured properly, or even know to exist.

...

Wouldn't it be sensible not to use the H98 file I/O operations at all anymore with binary files? A Char represents a Unicode code point value and is not the right data type to use to represent a byte from a binary stream.

That seems nice, we would not have to create a "wide char" type just for Unicode. This topic made me search the net for that nice quote: "Explanations exist: they have existed for all times, for there is always an easy solution to every problem — neat, plausible and wrong." (See: en.wikiquote.org/wiki/H._L._Mencken That guy has many quotes worth reading.) Strings as char lists is a very good example of that. It's simple and clean, but strings are not char lists in any reasonable sense. Best, Maurício

Duncan Coutts

29 Nov 29 Nov

12:44 p.m.

On Wed, 2007-11-28 at 17:38 -0200, Maurício wrote:

...

...
...
(...) When it's phrased as "truncates to 8 bits" it sounds so simple, surely all we need to do is not truncate to 8 bits right?

The problem is, what encoding should it pick? UTF8, 16, 32, EBDIC? (...)

One sensible suggestion many people have made is that H98 file IO should use the locale encoding and do Unicode/String <-> locale conversion. (...)

I'm really afraid of solutions where the behavior of your program changes with an environment variable that not everybody has configured properly, or even know to exist.

Jules Bean

1:05 p.m.

Duncan Coutts wrote:

...

On Wed, 2007-11-28 at 17:38 -0200, Maurício wrote:

...
...
...
(...) When it's phrased as "truncates to 8 bits" it sounds so simple, surely all we need to do is not truncate to 8 bits right?

The problem is, what encoding should it pick? UTF8, 16, 32, EBDIC? (...)

One sensible suggestion many people have made is that H98 file IO should use the locale encoding and do Unicode/String <-> locale conversion. (...)

I'm really afraid of solutions where the behavior of your program changes with an environment variable that not everybody has configured properly, or even know to exist.

Be afraid of all your standard Unix utils in that case. They are all locale dependent, not just for encoding but also for sorting order and the language of messages.

Language of messages is quite different from language of a file you read. Suppose I am English, and I have a russian friend, Vlad. My default locale is, say, latin-1, and his is something cyrillic. I might well open files including my own files, and his files. The locale of the current user is simple no guide to the correct encoding to read a file in, and not a particularly reliable guide to writing a file out. Locale makes perfect sense for messages (you are communicating with the user, his locale tells you what language he speaks). It makes much less sense for file IO. Jules

Duncan Coutts

1:30 p.m.

On Thu, 2007-11-29 at 13:05 +0000, Jules Bean wrote:

...

Language of messages is quite different from language of a file you read.

Suppose I am English, and I have a russian friend, Vlad.

My default locale is, say, latin-1, and his is something cyrillic.

I might well open files including my own files, and his files. The locale of the current user is simple no guide to the correct encoding to read a file in, and not a particularly reliable guide to writing a file out.

Locale makes perfect sense for messages (you are communicating with the user, his locale tells you what language he speaks). It makes much less sense for file IO.

Yes, it's a fundamental limitation of the unix locale system and multi-user systems. However it's no less wrong than just picking UTF8 all the time. Obviously one needs a text file api that allows one to specify the encoding for the cases where you happen to know it, but for the H98 file api where there is no way of specifying an encoding, what's better than using the unix default method? (at least on unix) Duncan

Maurício

30 Nov 30 Nov

12:46 p.m.

...

...
Language of messages is quite different from language of a file you read. (...)

...

Yes, it's a fundamental limitation of the unix locale system and multi-user systems. However it's no less wrong than just picking UTF8 all the time. (...)

Am I wrong to think that UTF8 should be THE standard? I believe it can encode anything encoded by other encodings. Can't we consider non-utf8 text as "legacy"? I don't like that word, but I do think it is the right way to go for text. If you know your text has a diferent encoding, just use 'iconv' to convert it, or a special Haskell library for conversion. That will make life difficult for a few, but make life a lot easier for programers and users. Maurício

Johan Tibell

2:57 p.m.

...

Am I wrong to think that UTF8 should be THE standard? I believe it can encode anything encoded by other encodings.

All the UTF-* encodings can encode the same code points. There are different trade offs though.

...

Can't we consider non-utf8 text as "legacy"? I don't like that word, but I do think it is the right way to go for text. If you know your text has a diferent encoding, just use 'iconv' to convert it, or a special Haskell library for conversion.

The important thing (I think) is to have an abstract concept that encompasses all the necessary characters (i.e. Unicode) and then a few well specified encodings with different trade offs. A Unicode Haskell library should handle at least a few of them (and more importantly keep track of the encoding.) -- Johan

Thomas Hartman

29 Nov 29 Nov

3:52 p.m.

A translation of http://www.ahinea.com/en/tech/perl-unicode-struggle.html from perl to haskell would be a very useful piece of documentation, I think. That explanation really helped me get to grips with the encoding stuff, in a perl context. thomas. Duncan Coutts Sent by: haskell-cafe-bounces@haskell.org 11/29/2007 07:44 AM To Maurício cc haskell-cafe@haskell.org Subject Re: [Haskell-cafe] Re: Strings and utf-8 On Wed, 2007-11-28 at 17:38 -0200, Maurício wrote:

...

...
...
(...) When it's phrased as "truncates to 8 bits" it sounds so simple, surely all we need to do is not truncate to 8 bits right?

The problem is, what encoding should it pick? UTF8, 16, 32, EBDIC? (...)

One sensible suggestion many people have made is that H98 file IO should use the locale encoding and do Unicode/String <-> locale conversion. (...)

I'm really afraid of solutions where the behavior of your program changes with an environment variable that not everybody has configured properly, or even know to exist.

Be afraid of all your standard Unix utils in that case. They are all locale dependent, not just for encoding but also for sorting order and the language of messages. Using the locale is standard Unix behaviour (and these days the locale usually specifies UTF8 encoding). On OSX the default should be UTF8. On Windows it's a bit less clear, supposedly text files should use UTF16 but nobody actually does that as far as I can see. Duncan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe --- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.

Reinier Lamers

4:07 p.m.

Thomas Hartman wrote:

...

A translation of

http://www.ahinea.com/en/tech/perl-unicode-struggle.html

from perl to haskell would be a very useful piece of documentation, I think.

Perl encodes both Unicode and binary data as the same (dynamic) data type. Haskell - at least in theory - has two different types for them, namely [Char] for characters and [Word8] or ByteString for sequences of bytes. I think the Haskell approach is better, because the programmer in most cases knows whether he wants to treat his data as characters or as bytes. Perl does it the Perlish "We guess at what the coder means" way, which leads to a lot of frustration when Perl guesses wrong. The problems of the Haskeller trying to use Unicode, I think, will be different from those of the Perl hacker trying to use Unicode: the Haskeller will have to search for third-party modules to do what he wants, and finding those modules is the problem. The Perl hacker has all the Unicode support built in, but has to fight Perl occasionally to keep it from doing byte operations on his Unicode data. I had a colleague here go all but insane last week trying to use 'split' on a Unicode string in Perl on Windows. split would break the string in the middle of a UTF-8 wide character, crashing UTF-8 processing later on. Reinier

Andrew Coppin

28 Nov 28 Nov

10:11 p.m.

Duncan Coutts wrote:

...

When it's phrased as "truncates to 8 bits" it sounds so simple, surely all we need to do is not truncate to 8 bits right?

The problem is, what encoding should it pick? UTF8, 16, 32, EBDIC? How would people specify that they really want to use a binary file. Whatever we change it'll break programs that use the existing meanings.

One sensible suggestion many people have made is that H98 file IO should use the locale encoding and do Unicode/String <-> locale conversion. So that'd all be text files. Then openBinaryFile would be used for binary files. Of course then we'd need control over setting the encoding and what to do on encountering encoding errors.

IMHO, someone should make a full proposal by implementing an alternative System.IO library that deals with all these encoding issues and implements H98 IO in terms of that.

It doesn't have to be fast initially, it just has to get the API right and not design the API so as to exclude the possibility of a fast implementation later.

In my humble opinion, what should happen is this: We need two seperate interfaces. One for text-mode I/O, one for raw binary I/O. ByteString provides some of the latter. [Can you use that on network sockets?] I guess what's needed is a good binary library to go with it. [I know there's been quite a few people who've had a go at this part...] When doing text-mode I/O, the programmer needs to be able to explicitly specify exactly which character encoding is required. (Presumably default to the current 8-bit truncation encoding?) That way the programmer can decide exactly how to choose an encoding, rather than the library designer trying to guess what The Right Thing is for all possible application programs. And it needs to be possible to cleanly add new encodings too. I'd have a go at implementing all this myself, but I wouldn't know where to begin...

Bulat Ziganshin

29 Nov 29 Nov

7:16 a.m.

New subject: Re[2]: Strings and utf-8

Hello Andrew, Thursday, November 29, 2007, 1:11:38 AM, you wrote:

...

...
IMHO, someone should make a full proposal by implementing an alternative System.IO library that deals with all these encoding issues and implements H98 IO in terms of that.

...

We need two seperate interfaces. One for text-mode I/O, one for raw binary I/O.

...

When doing text-mode I/O, the programmer needs to be able to explicitly specify exactly which character encoding is required. (Presumably default to the current 8-bit truncation encoding?)

http://haskell.org/haskellwiki/Library/Streams already exists -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Reinier Lamers

10:13 a.m.

Bulat Ziganshin wrote:

...

Hello Andrew,

Thursday, November 29, 2007, 1:11:38 AM, you wrote:

...
...
IMHO, someone should make a full proposal by implementing an alternative System.IO library that deals with all these encoding issues and implements H98 IO in terms of that.

...
We need two seperate interfaces. One for text-mode I/O, one for raw binary I/O.

...
When doing text-mode I/O, the programmer needs to be able to explicitly specify exactly which character encoding is required. (Presumably default to the current 8-bit truncation encoding?)

http://haskell.org/haskellwiki/Library/Streams already exists

Which would mean that we have streams to do character I/O, ByteString to do binary I/O, and System.IO to do, eh, something in between. That seems rather unfortunate to me. While the "truncate to 8 bits" semantics may be nice to keep old code working, it really isn't all that intuitive. When I do 'putStr "u\776"', I want a u with an umlaut to appear, not to get it printed as if it were "u\8". The strange thing is that Hugs at the moment _does_ print a u-umlaut, while ghci prints "u\8", which is a u followed by a backspace, so I see nothing. Reinier

Bulat Ziganshin

4:05 p.m.

New subject: Re[2]: Strings and utf-8

Hello Reinier, Thursday, November 29, 2007, 1:13:24 PM, you wrote:

...

...
...
...
IMHO, someone should make a full proposal by implementing an alternative System.IO library that deals with all these encoding issues and implements H98 IO in terms of that.

...

...
http://haskell.org/haskellwiki/Library/Streams already exists

Which would mean that we have streams to do character I/O, ByteString to do binary I/O, and System.IO to do, eh, something in between.

this means only that such proposal exists. i've worked on adding bytestream support too, but don't finished the work. at least it's possible. i hope that new i/o library will have modular design like this so it will be easy to add new features as 3rd-party libs -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

6424

Age (days ago)

6427

Last active (days ago)

List overview

Download

17 comments

11 participants

participants (11)

Andrew Coppin
Brandon S. Allbery KF8NH
Bulat Ziganshin
Don Stewart
Duncan Coutts
Johan Tibell
Jules Bean
Maurício
Paul Johnson
Reinier Lamers
Thomas Hartman

Strings and utf-8

Reinier Lamers

Johan Tibell

Reinier Lamers

Andrew Coppin

Bulat Ziganshin

Reinier Lamers

Bulat Ziganshin

tags

participants (11)