H98 Text IO

newer
every use of BSD4 on hackage is...

older
darcs patch: #2053: add additional...

Duncan Coutts

26 Feb 2008 26 Feb '08

11:47 a.m.

...

From the H98 report:

All I/O functions defined here are character oriented. [...] These functions cannot be used portably for binary I/O. In the following, recall that String is a synonym for [Char] (Section 6.1.2). So ordinary text Handles are for text, not binary. Char is of course a Unicode code point. The crucial question of course is what encoding of text to use. For the H98 IO functions we cannot set it as a parameter, we have to pick a sensible default. Currently different implementations disagree on that default. Hugs has for some time used the current locale on posix systems (and I'm guessing the current code page on windows). GHC has always used the Latin-1 encoding. These days, most operating systems use a locale/codepage encoding that covers full the Unicode range. So on hugs we get the benefit of that but on GHC we do not. This is endlessly surprising for beginners. They do putStrLn "αβγδεζηθικλ" and it comes out on their terminal as junk. It also causes problems for serious programs, see for example the recent hand-wringing on cabal-devel. So here is a concrete proposal: * Haskell98 file IO should always use UTF-8. * Haskell98 IO to terminals should use the current locale encoding. The main controversial point I think is whether to always use UTF-8 or always use the current locale or some split as I've suggested. C chose to always go with the current locale. Some people think that was a mistake because the interpretation changes from user to user. For terminals it is more clear cut that the locale is the right choice because that is what the terminal is capable of displaying. Using anything else will produce junk. We can detect if a handle is a terminal when we open it using hIsTerminalDevice. This should be done automatically (and ghc would ghc get it for free because it already does that check to determine default buffering modes). Sockets and pipes would be treated the same as files when opened in the default text mode. The only special case is terminals. The major problem is with code that assumes GHC's Handles are essentially Word8 and layer their own UTF8 or other decoding over the top. The utf8-string package has this problem for example. Such code should be using openBinaryFile because they are reading/writing binary data, not String text. Note that many programs that really need to work with binary file already use openBinaryFile, those that do not are already broken on Windows which does cr/lf conversion on text files which breaks many binary formats (though not utf8). So we have decide which is more painful, keeping a limited text IO system in GHC or breaking some existing programs which assume GHC's current behaviour. Opinions? Please can we keep this discussion to the interpretation of the H98 IO functions and not get into the separate discussion of how we could extend or redesign the whole IO system. This is a questions of what are the right defaults. Duncan

Show replies by date

John Meacham

26 Feb 26 Feb

12:41 p.m.

I came to the same conclusions. I think using either the current encoding or utf8 are perfectly reasonable interpretations of the standard. Jhc used to use the current locale always, but now it uses utf8 always as that was easier to make portable to other operating systems. (though current locale support will likely be added back at some point) I think this is a-okay as far as haskell 98 goes. Assuming latin1 without doing an 'openBinaryFile' is certainly not okay in my book. John -- John Meacham - ⑆repetae.net⑆john⑈

Ross Paterson

12:44 p.m.

On Tue, Feb 26, 2008 at 11:47:49AM +0000, Duncan Coutts wrote:

...

The major problem is with code that assumes GHC's Handles are essentially Word8 and layer their own UTF8 or other decoding over the top. The utf8-string package has this problem for example. Such code should be using openBinaryFile because they are reading/writing binary data, not String text.

As I was saying on cabal-devel, I think this distinction ought to be in the types, i.e. we need, in base, a type distinct from Handle that offers a Word8 interface to binary I/O, as a foundation for various experiments with encodings (which need not all be in base).

Jules Bean

12:59 p.m.

Ross Paterson wrote:

...

On Tue, Feb 26, 2008 at 11:47:49AM +0000, Duncan Coutts wrote:

...
The major problem is with code that assumes GHC's Handles are essentially Word8 and layer their own UTF8 or other decoding over the top. The utf8-string package has this problem for example. Such code should be using openBinaryFile because they are reading/writing binary data, not String text.

As I was saying on cabal-devel, I think this distinction ought to be in the types, i.e. we need, in base, a type distinct from Handle that offers a Word8 interface to binary I/O, as a foundation for various experiments with encodings (which need not all be in base).

I agree with a separate handle type, but Duncan's proposal is all about "fixing" the fact that the H98 library doesn't implement the H98 spec. I think that proposal should pass independently. I then think that a System.IO.Binary library, which provides a newtyped Handle for Word8 IO would be an excellent thing to propose next! Jules

Duncan Coutts

1:07 p.m.

On Tue, 2008-02-26 at 12:44 +0000, Ross Paterson wrote:

...

On Tue, Feb 26, 2008 at 11:47:49AM +0000, Duncan Coutts wrote:

...
The major problem is with code that assumes GHC's Handles are essentially Word8 and layer their own UTF8 or other decoding over the top. The utf8-string package has this problem for example. Such code should be using openBinaryFile because they are reading/writing binary data, not String text.

As I was saying on cabal-devel, I think this distinction ought to be in the types, i.e. we need, in base, a type distinct from Handle that offers a Word8 interface to binary I/O, as a foundation for various experiments with encodings (which need not all be in base).

I agree. If we can come to a consensus on the interpretation of the H98 text Handles then the next step is to start a discussion on a standard binary IO system (and I'd certainly support using a different type of Handle so we never mix up binary data and [Char]). The main point of difference so far seems to be whether we pick a fixed utf8 encoding or the the current locale encoding or some mixture depending on the kind of IO object. I think that's where we should focus the discussion initially. It'd be nice if there was agreement between the different implementations. It seems we're not far from agreement between at least hugs, ghc and jhc. Ross, perhaps you can put the argument for what hugs currently does - always using the locale for all terminal an text file IO rather than picking a fixed encoding. Duncan

Ross Paterson

1:33 p.m.

On Tue, Feb 26, 2008 at 01:07:50PM +0000, Duncan Coutts wrote:

...

It'd be nice if there was agreement between the different implementations. It seems we're not far from agreement between at least hugs, ghc and jhc.

Ross, perhaps you can put the argument for what hugs currently does - always using the locale for all terminal an text file IO rather than picking a fixed encoding.

I'm not going claim it's ideal, but the situation created by Haskell 98 is that Handles are supposed to deal in Chars, but their relationship to external encodings is undefined. Given that, implementations have to make a somewhat arbitrary choice. I suppose the argument for the locale is the UTF-8 has not yet taken over the world. I agree that it's weaker for files and sockets, since they are shared between different systems. I'm not worried about breaking broken programs. I'll just note in passing that similar issues arise with system calls, notably file operations, program arguments and the environment. But I/O is probably sufficient trouble for today.

Simon Marlow

1:22 p.m.

Duncan Coutts wrote:

...

...
From the H98 report:

All I/O functions defined here are character oriented. [...] These functions cannot be used portably for binary I/O.

In the following, recall that String is a synonym for [Char] (Section 6.1.2).

So ordinary text Handles are for text, not binary. Char is of course a Unicode code point.

The crucial question of course is what encoding of text to use. For the H98 IO functions we cannot set it as a parameter, we have to pick a sensible default. Currently different implementations disagree on that default. Hugs has for some time used the current locale on posix systems (and I'm guessing the current code page on windows). GHC has always used the Latin-1 encoding.

These days, most operating systems use a locale/codepage encoding that covers full the Unicode range. So on hugs we get the benefit of that but on GHC we do not.

This is endlessly surprising for beginners. They do putStrLn "αβγδεζηθικλ" and it comes out on their terminal as junk.

It also causes problems for serious programs, see for example the recent hand-wringing on cabal-devel.

So here is a concrete proposal:

* Haskell98 file IO should always use UTF-8. * Haskell98 IO to terminals should use the current locale encoding.

While I support Duncan's proposal (we discussed it on IRC), I thought I should point out some of the ramifications of this, and the alternatives. If everything that is not a terminal uses UTF-8 by default, then shell commands may behave in an unexpected way, e.g. for a Haskell program "prog", prog | cat will output in UTF-8, and if your locale encoding is something other than UTF-8 you'll see junk. Similarly, prog >file; cat file will give the same (wrong) result. So some alternatives that fix this are 1. all text I/O is in the locale encoding (what C and Hugs do) 2. stdin/stdout/stderr and terminals are always in the locale encoding, everything else is UTF-8 3. everything is UTF-8 (1) has the advantage of being easy to understand, but causes problems when you want to move a file created on one system to another system, or share files between users. The programmer in this case has to anticipate the problem and set an encoding (and we're not proposing to provide a way to specify encodings, yet, so openBinaryFile and a separate UTF-8 step would be required). (2) has a sort of "do what I want" feel, and will almost certanly cause confusion in some cases, simply because it's an aribtrary choice. (3) is easy to understand, but does the wrong thing for people who have a locale encoding other than UTF-8. Duncan's proposal occupies a useful point: text that we know to be ephemeral, because it is being sent to a terminal, is definitely sent in the user's default encoding. Text that might be persistent or might be crossing a locale-boundary is always written in UTF-8, which is good for interchange and portability, the catch is that sometimes we identify a Handle as persistent when it is really ephemeral. Note that sensible people who set their locale to UTF-8 are not affected by any of this - and that includes most new installations of Linux these days, I believe. Cheers, Simon

Duncan Coutts

1:34 p.m.

On Tue, 2008-02-26 at 13:22 +0000, Simon Marlow wrote:

...

So some alternatives that fix this are

1. all text I/O is in the locale encoding (what C and Hugs do)

2. stdin/stdout/stderr and terminals are always in the locale encoding, everything else is UTF-8

I was initially confused about how this one was different from what I first proposed. The difference is that I was suggesting stdin/stdout/stderr be in the locale *only* if thet're connected to a terminal, rather than always.

...

3. everything is UTF-8

Personally I'm not really fussed about which compromise we pick. I think the more important point is that all the Haskell implementations pick the same compromise so that we can effectively standardise the behaviour. Duncan

John Meacham

3:28 p.m.

On Tue, Feb 26, 2008 at 01:34:54PM +0000, Duncan Coutts wrote:

...

Personally I'm not really fussed about which compromise we pick. I think the more important point is that all the Haskell implementations pick the same compromise so that we can effectively standardise the behaviour.

Wait, are you talking about changing what ghc does or trying to change the haskell standard? I always thought ghc should do something more sane with character IO, non unicode aware programs are a blight. I don't think choosing something arbitrary to standardize on is a good idea. It is not always clear what the best choice is. like, for instance until recently, jhc used locale encoding on linux, due to glibc's strong charset support and guarenteed use of unicode wchar_t's, but utf8 always on bsd-varients, where the wchar_t situation was less clear cut. On embedded systems, only supporting ASCII IO is certainly a valid choice. For a .NET backend, we will want to use .NET's native character IO routines. The important thing is standardizing how _binary_ handles work across compilers. As long as everyone has a compatible openBinaryHandle then we can layer whatever we want on it with compatible libraries. I think the current behavior of GHC is poor and should be fixed, I believe the intent of the haskell 98 standard is that character IO be performed in a suitable system specific way, which always truncating to 8bits does not meet IMHO. But no need to prescribe something arbitrary language-wide for a particular issue with ghc. John -- John Meacham - ⑆repetae.net⑆john⑈

Duncan Coutts

9:15 p.m.

On Tue, 2008-02-26 at 07:28 -0800, John Meacham wrote:

...

On Tue, Feb 26, 2008 at 01:34:54PM +0000, Duncan Coutts wrote:

...
Personally I'm not really fussed about which compromise we pick. I think the more important point is that all the Haskell implementations pick the same compromise so that we can effectively standardise the behaviour.

Wait, are you talking about changing what ghc does or trying to change the haskell standard? I always thought ghc should do something more sane with character IO, non unicode aware programs are a blight.

I don't think choosing something arbitrary to standardize on is a good idea. It is not always clear what the best choice is. like, for instance until recently, jhc used locale encoding on linux, due to glibc's strong charset support and guarenteed use of unicode wchar_t's, but utf8 always on bsd-varients, where the wchar_t situation was less clear cut. On embedded systems, only supporting ASCII IO is certainly a valid choice. For a .NET backend, we will want to use .NET's native character IO routines.

Oh I wasn't trying to pin it down that much. If you want to use ebdic on some embedded platform by default I don't care. I really mean that it'd be nice if hugs, ghc, jhcm nhc98 etc could agree for each of the major platforms, Linux/Unix, OS X and Windows. And I don't mean necessarily that they should do the same thing across platforms (eg as I understand it OS X would always use UTF8 not a variable locale) just that they should do the same on the same platform. So not a change of the H98 spec, just a common consensus on the major platforms. Duncan

David Leuschner

10:40 p.m.

...

I really mean that it'd be nice if hugs, ghc, jhcm nhc98 etc could agree for each of the major platforms, Linux/Unix, OS X and Windows. And I don't mean necessarily that they should do the same thing across platforms (eg as I understand it OS X would always use UTF8 not a variable locale) just that they should do the same on the same platform.

That's exactly what I (an employee developing commercially used Haskell applications) would like to see. Java does the same thing and it always works as expected and that's always best. Java has a platform default encoding which is not fixed (on Linux it's dependent on the current locale as set by the LC_CTYPE, LC_ALL or LANG environment variables) but is determined in a way consistent with the platform. The platform default encoding is only used if no other encoding is explicitly given. In general when considering industrial adoption it's probably always a good idea to have a look at Java. We've never had (real) problems with Java programs, but lots of problems with Python, Haskell and Ocaml. If I write simple program just printing a non-ASCII string to the terminal or to a file I'd expect that I can read it on the screen or using my favorite text editor without having change anything -- neither in my terminal nor in my program. When I run the program on my platform don't mind if somebody else might get differently encoded output from the same program as long as I get what I expect. If I as a programmer really want to make sure that everybody gets the same output I can make sure a specific encoding is used. Cheers, David -- David Leuschner Meisenweg 7 79211 Denzlingen Tel. 07666/912466

Ben Franksen

28 Feb 28 Feb

10:53 p.m.

David Leuschner wrote:

...

If I write simple program just printing a non-ASCII string to the terminal or to a file I'd expect that I can read it on the screen or using my favorite text editor without having change anything -- neither in my terminal nor in my program. When I run the program on my platform don't mind if somebody else might get differently encoded output from the same program as long as I get what I expect.

I think this is sensible advice. Furthermore I want to be able to chose an encoding (e.g. by setting my environment), even for file/socket IO, instead of being locked into utf-8. Thus, for Linux, I'd say always use the environment encoding. More advanced stuff like sending in a certain encoding e.g. over network or to a foreign file system should not use H98 IO but newer libraries that allow to set the encoding programmatically, or encode yourself and read/write Word8. Cheers Ben

Simon Marlow

26 Feb 26 Feb

2:18 p.m.

Simon Marlow wrote:

...

Duncan Coutts wrote:

Let's call this one proposal 0:

...

...
* Haskell98 file IO should always use UTF-8. * Haskell98 IO to terminals should use the current locale encoding.

and the others:

...

1. all text I/O is in the locale encoding (what C and Hugs do)

2. stdin/stdout/stderr and terminals are always in the locale encoding, everything else is UTF-8

3. everything is UTF-8

Some other points that came up on IRC: - there's a long precedent for behaving differently when connected to a terminal. For example, 'ls' formats output in columns when connected to a terminal, or displays output in colour. This is a point in favour of (0). - we might expect that "prog file" behaves the same as "prog

Bulat Ziganshin

3:57 p.m.

New subject: Re[2]: H98 Text IO

Hello Simon, Tuesday, February 26, 2008, 5:18:17 PM, you wrote:

...

Let's call this one proposal 0: and the others:

my program now uses 4 different encodings with a manual reeencoding between them. it seems that any proposal will break it - now it assumes that getChar function just reads 8-bit value and dress it into Char i don't care about other Haskell implementations, but if they need to be unified, it may be better to count code base written with GHC and all other implementations in mind and force all other *hc to use the same (broken) semantics as ghc really, i wonder. afaiu, the phrase "avoid success at any cost" has the same double meaning in English as in Russian and now, when Haskell got real chances to become widely adopted by industry, it seems that we try to avoid it... at any cost -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Simon Marlow

4:34 p.m.

Bulat Ziganshin wrote:

...

Hello Simon,

Tuesday, February 26, 2008, 5:18:17 PM, you wrote:

...
Let's call this one proposal 0: and the others:

my program now uses 4 different encodings with a manual reeencoding between them. it seems that any proposal will break it - now it assumes that getChar function just reads 8-bit value and dress it into Char

Use openBinaryFile (or hSetBinaryMode) and your program will work with all compilers, both before and after this change.

...

really, i wonder. afaiu, the phrase "avoid success at any cost" has the same double meaning in English as in Russian and now, when Haskell got real chances to become widely adopted by industry, it seems that we try to avoid it... at any cost

I don't get it - aren't we improving things by providing support for non-Latin-1 encodings and unifying the compilers? Cheers, Simon

Bulat Ziganshin

5:51 p.m.

New subject: Re[2]: H98 Text IO

Hello Simon, Tuesday, February 26, 2008, 7:34:50 PM, you wrote:

...

Use openBinaryFile (or hSetBinaryMode) and your program will work with all compilers, both before and after this change.

it is what i mean - we just have found one more way to break compatibility with existing haskell code. it is important part of our efforts to keep haskell community as small as possible, ideally it should include just authors of haskell tools itself

...

...
really, i wonder. afaiu, the phrase "avoid success at any cost" has the same double meaning in English as in Russian and now, when Haskell got real chances to become widely adopted by industry, it seems that we try to avoid it... at any cost

...

I don't get it - aren't we improving things by providing support for non-Latin-1 encodings and unifying the compilers?

this will improve things only for haskell novices. real men (including yourself) are already adopted various ways to live with it i again highlight that if we want to put ghc into industrial use, it should keep compatibility with old versions -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Johan Tibell

7:55 p.m.

New subject: Re[2]: H98 Text IO

On Tue, Feb 26, 2008 at 6:51 PM, Bulat Ziganshin wrote:

...

Hello Simon,

Tuesday, February 26, 2008, 7:34:50 PM, you wrote:

...
Use openBinaryFile (or hSetBinaryMode) and your program will work with all compilers, both before and after this change.

it is what i mean - we just have found one more way to break compatibility with existing haskell code. it is important part of our efforts to keep haskell community as small as possible, ideally it should include just authors of haskell tools itself

Python 3000 is backwards incompatible with many current Python programs. I don't expect lots of Pythonistas to jump ship because of it. In the process of thinking about what should go into Python 3000 they recognized that breaking backwards compatibility was necessary to fix some problems (e.g. the same Unicode problems as we have in the Haskell community.) Agreed, you don't want to continuously break people's programs but I think it's OK to sometimes do it if it's really needed for the long-term usefulness of the language. We should look at the Python community and try to provide an as smooth upgrade path as possible. One possibility is to keep old compilers around for a time. -- Johan

Ian Lynagh

5:16 p.m.

[dropped g-h-u as this is really a libraries discussion, and cross-posted threads are a pain] On Tue, Feb 26, 2008 at 02:18:17PM +0000, Simon Marlow wrote:

...

Simon Marlow wrote:

...
Duncan Coutts wrote:

Let's call this one proposal 0:

...
...
* Haskell98 file IO should always use UTF-8. * Haskell98 IO to terminals should use the current locale encoding.

...
1. all text I/O is in the locale encoding (what C and Hugs do)

2. stdin/stdout/stderr and terminals are always in the locale encoding, everything else is UTF-8

3. everything is UTF-8

3 is my favourite. It means that if I run ./foo and I want to see exactly what the output was, I can run ./foo | hexdump -C and get something consistent with the first run. 0 breaks the above. 1 means that if you and I both generate a file we don't necessarily get the same file. 2 I don't like for your "prog in" vs "prog < in" reason, and likewise "prog -o out" vs "prog > out". I think it's important that we have some way of sending/getting binary stuff to/from std*, though (and we need to make sure it plays nicely with buffering). Thanks Ian

Bulat Ziganshin

5:41 p.m.

New subject: Re[2]: H98 Text IO

Hello Ian, Tuesday, February 26, 2008, 8:16:57 PM, you wrote:

...

...
...
3. everything is UTF-8

...

3 is my favourite. It means that if I run

just to let you know - default encoding for Windows console windows is OEM. to make things funnier, getArgs returns strings in ANSI and openFile uses ANSI strings too -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Duncan Coutts

27 Feb 27 Feb

12:06 a.m.

On Tue, 2008-02-26 at 14:18 +0000, Simon Marlow wrote:

...

Simon Marlow wrote:

...
Duncan Coutts wrote:

Let's call this one proposal 0:

...
...
* Haskell98 file IO should always use UTF-8. * Haskell98 IO to terminals should use the current locale encoding.

and the others:

...
1. all text I/O is in the locale encoding (what C and Hugs do)

2. stdin/stdout/stderr and terminals are always in the locale encoding, everything else is UTF-8

3. everything is UTF-8

So it's clear that all these solutions have some downsides. We have to decide what is more important. Let me try and summarise: basically we can be consistent with the OS environment or consistent with other Haskell systems in other environments or try to get some mixture of the two. It is pretty clear however that trying to get a mixture still leads to some inconsistency with the OS environment. * "status quo" (what ghc/hugs do now) This gives consistency with the OS environment with hugs and jhc but not ghc, nhc or yhc. It gives consistency between haskell programs (using the same haskell implementation) on different platforms for ghc and nhc but not for hugs or jhc. There is no consistency between haskell implementations. * "always locale" (solution 1 above) This gives us consistency with the OS environment. All of the shell snippets people have posted work with this. The main disadvantage is that files moved between systems may be interpreted differently. * "always utf8" (solution 3 above) This gives consistency between Haskell programs across platforms. The main disadvantage is that it is very unhelpful if the locale is not UTF8. It fails the "putStr" test of printing string literals to the terminal. * "mixture A" (solution 0 above) The input/output format changes depending on the device. prog | cat prints junk in non-UTF8 locales. * "mixture B" (solution 2 above) The output format changes depending on the device. prog in behaves differently to prog < in. And some example people have noted: * putStr "αβγδεζηθικλ" That is just printing a string literal to the console/terminal. Now that major implementations support Unicode .hs source files it's kind of nice if this works. This works with "always locale" and "mixture A" and "mixture B" above. This fails for "status quo" with ghc (but works for hugs) and fails for "always utf8" unless the locale happens to be utf8. * ./prog vs ./prog | cat That is, piping the output of a haskell program through cat and printing the result to a terminal produces the same output as displaying the program output directly. This works with "always locale" and "mixture B" and fails with "mixture A". With "always utf8" and with "status quo" it has the property that it consistently produces the same junk on the terminal which some people see as a bonus (when not in a utf8 or latin1 locale respectively). * ./prog vs ./prog >file; cat file This is another variation on the above and it has the same failures. * ./prog in vs ./prog < in That is reading a file given as a command line arg via readFile gives the same result as reading stdin that has been redirected from a the same file. This works with "always locale" and "mixture A" and fails with "mixture B". This is the dual of the previous two examples. This fails with "always utf8" and with "status quo" when the file was produced by another text processing program from the same environment (eg a generic text editor). * ./foo vs ./foo | hexdump -C The output bytes we get sent to the terminal is exactly the same as what we see piped to a program to examine those bytes. This fails for "mixture A" and works for all the others. Works in the strict sense that the bytes are the same, not in the sense that the text output is readable. So the problem with the mixture approaches is that the terminal and files and pipes are all really interchangeable so we can find surprising inconsistencies within the same OS environment. The problem with the "always utf8" is that it's never right unless the locale is set to utf8. As a data point, Java and python use "always locale" as default if you don't specify an encoding when opening a text stream. I think personally I'm coming round to the "always locale" point of view. We already have no cross-platform consistency for text files because of the lf vs cr/lf issue and we have no cross-implementation consistency. Duncan

David Roundy

12:30 a.m.

On Wed, Feb 27, 2008 at 12:06:59AM +0000, Duncan Coutts wrote:

...

I think personally I'm coming round to the "always locale" point of view. We already have no cross-platform consistency for text files because of the lf vs cr/lf issue and we have no cross-implementation consistency.

I'm leaning in the same direction. -- David Roundy Department of Physics Oregon State University

Ian Lynagh

4:30 p.m.

On Tue, Feb 26, 2008 at 07:30:08PM -0500, David Roundy wrote:

...

On Wed, Feb 27, 2008 at 12:06:59AM +0000, Duncan Coutts wrote:

...
I think personally I'm coming round to the "always locale" point of view. We already have no cross-platform consistency for text files because of the lf vs cr/lf issue and we have no cross-implementation consistency.

I'm leaning in the same direction.

This has swung me too. Thanks Ian

David Leuschner

7:33 a.m.

...

Let me try and summarise:

Thanks for the great summary! And thanks to Emacs' table mode here're the results displayed as a table: +--------------------------------+-----+--------+------+-------+-------+ | | now | locale | utf8 | mix-A | mix-B | +--------------------------------+-----+--------+------+-------+-------+ | putStrLn "..." | - | ok | - | ok | ok | +--------------------------------+-----+--------+------+-------+-------+ | ./prog vs ./prog | cat | ok | ok | ok | - | ok | +--------------------------------+-----+--------+------+-------+-------+ | ./prog in vs ./prog < in | - | ok | - | ok | - | +--------------------------------+-----+--------+------+-------+-------+ | ./prog vs ./prog | hexdump -C | ok | ok | ok | - | ok | +--------------------------------+-----+--------+------+-------+-------+ The mixtures are good ideas but can give inconsistent and suriprising results (especially when debugging encoding issues). And if our CEO would have known that ... putStrLn <his-name> ... doesn't work he'd have probably ruled out Haskell right from the start. Even "utf8" gives surprising results: I'd be very surprised if my Mac-written Haskell program outputs junk on Windows or Linux even if the byte sequence is exactly the same UTF-8 text. Personally I think consistency on a single platform is more important than trying to achieve cross-platform consistency which involves a lot more than just encoding. If you've reached that point with your program you're probably anyway using "advanced functions" to exactly specify what will be output. Following "the principle of least surprise" is also a good idea. Cheers, David -- David Leuschner Meisenweg 7 79211 Denzlingen Tel. 07666/912466

Chris Kuklewicz

8:54 a.m.

Small correction: I think "./prog in vs ./prog < in " and "utf8" should be "ok". (and I thought this was switched to Glasgow-haskell-users@haskell.org) David Leuschner wrote:

...

...
Let me try and summarise:

Thanks for the great summary! And thanks to Emacs' table mode here're the results displayed as a table:

+--------------------------------+-----+--------+------+-------+-------+ | | now | locale | utf8 | mix-A | mix-B | +--------------------------------+-----+--------+------+-------+-------+ | putStrLn "..." | - | ok | - | ok | ok | +--------------------------------+-----+--------+------+-------+-------+ | ./prog vs ./prog | cat | ok | ok | ok | - | ok | +--------------------------------+-----+--------+------+-------+-------+ | ./prog in vs ./prog < in | - | ok | - | ok | - | +--------------------------------+-----+--------+------+-------+-------+ | ./prog vs ./prog | hexdump -C | ok | ok | ok | - | ok | +--------------------------------+-----+--------+------+-------+-------+

The mixtures are good ideas but can give inconsistent and suriprising results (especially when debugging encoding issues). And if our CEO would have known that ... putStrLn <his-name> ... doesn't work he'd have probably ruled out Haskell right from the start. Even "utf8" gives surprising results: I'd be very surprised if my Mac-written Haskell program outputs junk on Windows or Linux even if the byte sequence is exactly the same UTF-8 text.

Personally I think consistency on a single platform is more important than trying to achieve cross-platform consistency which involves a lot more than just encoding. If you've reached that point with your program you're probably anyway using "advanced functions" to exactly specify what will be output. Following "the principle of least surprise" is also a good idea.

Cheers,

David

------------------------------------------------------------------------

_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Duncan Coutts

11:51 a.m.

On Wed, 2008-02-27 at 08:54 +0000, Chris Kuklewicz wrote:

...

Small correction:

I think "./prog in vs ./prog < in " and "utf8" should be "ok".

Ah yes, quite right. Similarly ./prog -o out vs ./prog > out because neither involve printing to the terminal. Duncan

Johan Tibell

9:25 a.m.

On Wed, Feb 27, 2008 at 1:06 AM, Duncan Coutts wrote:

...

As a data point, Java and python use "always locale" as default if you don't specify an encoding when opening a text stream.

I think personally I'm coming round to the "always locale" point of view. We already have no cross-platform consistency for text files because of the lf vs cr/lf issue and we have no cross-implementation consistency.

I think following Java and Python in this matter is a good idea and leads to fewer surprises for developers. If you want files created on one machine to work on another you have to be explicit about encoding. -- Johan

Takano Akio

11:22 a.m.

Duncan Coutts wrote:

...

Let me try and summarise:

basically we can be consistent with the OS environment or consistent with other Haskell systems in other environments or try to get some mixture of the two. It is pretty clear however that trying to get a mixture still leads to some inconsistency with the OS environment.

I would vote for "always locale". If one starts up an editor, types some text and saves it, it is probably in the locale's encoding. A user will be surprised if a Haskell program fails to read the resulting file. Also, being consistent with C and Java means both users and developers are likely to be familiar with the behavior. Regards, Takano Akio

Roman Leshchinskiy

26 Feb 26 Feb

1:31 p.m.

Duncan Coutts wrote:

...

So here is a concrete proposal:

* Haskell98 file IO should always use UTF-8. * Haskell98 IO to terminals should use the current locale encoding.

Personally, I'd find this deeply surprising. I don't care that much what locale gets used for I/O (if it matters, you have to deal with it explicitly anyway) as long as it is consistent. I'm probably mistaken, but doesn't this proposal mean that I can't implement cat in H98 using text I/O? That would be a bit disturbing. Also, would this affect the encoding used for file names? If so, how? Roman

Ross Paterson

1:36 p.m.

On Wed, Feb 27, 2008 at 12:31:43AM +1100, Roman Leshchinskiy wrote:

...

I'm probably mistaken, but doesn't this proposal mean that I can't implement cat in H98 using text I/O? That would be a bit disturbing.

That is already the case, at least for binary files.

Roman Leshchinskiy

1:50 p.m.

Ross Paterson wrote:

...

On Wed, Feb 27, 2008 at 12:31:43AM +1100, Roman Leshchinskiy wrote:

...
I'm probably mistaken, but doesn't this proposal mean that I can't implement cat in H98 using text I/O? That would be a bit disturbing.

That is already the case, at least for binary files.

Ok, cat is a bad example. Any kind of simple text processing, really, that assumes that reading a character from stdin and writing it to stdout is really just copying. Roman

Duncan Coutts

9:28 p.m.

On Wed, 2008-02-27 at 00:50 +1100, Roman Leshchinskiy wrote:

...

Ross Paterson wrote:

...
On Wed, Feb 27, 2008 at 12:31:43AM +1100, Roman Leshchinskiy wrote:

...
I'm probably mistaken, but doesn't this proposal mean that I can't implement cat in H98 using text I/O? That would be a bit disturbing.

That is already the case, at least for binary files.

Ok, cat is a bad example. Any kind of simple text processing, really, that assumes that reading a character from stdin and writing it to stdout is really just copying.

Well it would copy valid text files, possibly with some normalisation in encoding. Duncan

Duncan Coutts

1:46 p.m.

On Wed, 2008-02-27 at 00:31 +1100, Roman Leshchinskiy wrote:

...

Duncan Coutts wrote:

...
So here is a concrete proposal:

* Haskell98 file IO should always use UTF-8. * Haskell98 IO to terminals should use the current locale encoding.

Personally, I'd find this deeply surprising. I don't care that much what locale gets used for I/O (if it matters, you have to deal with it explicitly anyway) as long as it is consistent. I'm probably mistaken, but doesn't this proposal mean that I can't implement cat in H98 using text I/O? That would be a bit disturbing.

You've never been able to do that with the guarantees provided by H98. The current base lib provides System.IO.openBinaryFile which does make it possible to implement cat on binary files.

...

Also, would this affect the encoding used for file names? If so, how?

No, that's a separate issue. Duncan

Roman Leshchinskiy

2:14 p.m.

Duncan Coutts wrote:

...

On Wed, 2008-02-27 at 00:31 +1100, Roman Leshchinskiy wrote:

...
Duncan Coutts wrote:

...
So here is a concrete proposal:

* Haskell98 file IO should always use UTF-8. * Haskell98 IO to terminals should use the current locale encoding. Personally, I'd find this deeply surprising. I don't care that much what locale gets used for I/O (if it matters, you have to deal with it explicitly anyway) as long as it is consistent. I'm probably mistaken, but doesn't this proposal mean that I can't implement cat in H98 using text I/O? That would be a bit disturbing.

You've never been able to do that with the guarantees provided by H98.

As a matter of fact, 21.10.2 from the Haskell Report suggests that at least copying text files should be possible. Unless I'm mistaken, your proposal would invalidate that example somewhat. This begs another question. What exactly does "current locale" mean, given that we have lazy I/O and the locale can be changed on the fly?

...

...
Also, would this affect the encoding used for file names? If so, how?

No, that's a separate issue.

Hmm, so how do I reliably read a list of file names from a file? Roman

Simon Marlow

3:26 p.m.

Roman Leshchinskiy wrote:

...

Duncan Coutts wrote:

...
On Wed, 2008-02-27 at 00:31 +1100, Roman Leshchinskiy wrote:

...

...
...
Also, would this affect the encoding used for file names? If so, how?

No, that's a separate issue.

Hmm, so how do I reliably read a list of file names from a file?

You didn't say what format the file takes, so there are a couple of options. If you get to choose the format, then using read/show is easiest. If you're stuck with a predefined format, say one filename per line, then it depends what system you're on: - on Windows, filenames are Unicode, so the file must be in some encoding: decode it appropriately. - on Unix, filenames are binary, so use openBinaryFile and hGetLine. Yes, this is all broken (in particular FilePath == [Char] is wrong), but at least it's possible to do what you want, and it's not getting any worse with the proposed change. Filenames are something else that need an overhaul, but one thing at a time. Cheers, Simon

Duncan Coutts

9:24 p.m.

On Wed, 2008-02-27 at 01:14 +1100, Roman Leshchinskiy wrote:

...

Duncan Coutts wrote:

...
On Wed, 2008-02-27 at 00:31 +1100, Roman Leshchinskiy wrote:

...

...
...
I'm probably mistaken, but doesn't this proposal mean that I can't implement cat in H98 using text I/O? That would be a bit disturbing.

You've never been able to do that with the guarantees provided by H98.

As a matter of fact, 21.10.2 from the Haskell Report suggests that at least copying text files should be possible. Unless I'm mistaken, your proposal would invalidate that example somewhat.

...

This begs another question. What exactly does "current locale" mean, given that we have lazy I/O and the locale can be changed on the fly?

The current locale is a Posix concept. There are posix functions for changing it. I'd suggest that a Handle inherits the current locale as its encoding at the point of creation of the Handle. Further changes to the posix locale would not change any existing open Handles. If we were to provide an action to change the encoding of an open Handle then it is clear that it cannot act on semi-closed handles. That'd make lazy IO ok.

Chris Kuklewicz

5:42 p.m.

The H98 spec has the inside half of story nailed down: Char is Unicode, and Handles are text I/O that deal in [Char]. The outside half of the story is the binary encoding of the [Char], which was unspecified, and left to the implementation. The implementation dependence allows GHC to create a "setHandleEncoding" (or "withHandleEncoding") operation. [I do not want to get bogged down in syntax]. This is something that, like all details of encoding, is not the H98 spec. In addition, there may be some command line parameters to GHC. Imagine that GHC 6.10.1 is released with encoding support. If the user runs ghc with no options or setup changes, then the new defaults will apply. The goal is that more complicated situations are reflected in more complicated "ghc" or "main" invocations. The least complicated usage defaults to being identical cross-platform and regardless of terminal I/O. I think the best default would be UTF8 for all text handles. This can be easily documented, it can be easily understood, and will produce the fewest suprises. I imagine that in this proposed ghc-6.10.1: * GHC's handles now carry an encoding parameter. ** There is a way to create a new handle from an old one that differs only in the encoding. (perhaps 'hNew <- cloneHandleWithEncoding "Latin1" hOld') * GHC's has mutable global variables that control the encoding parameter of new handles. ** Unless influenced by command-line switches, these default to UTF8. ** There are IO commands to read & write these global variables. ** There are different defaults for new terminal I/O handles and other I/O handles, so they could be given different encodings. If you want to use the "local" or native encoding, then compile with "ghc --local-encoding" or start the program with something like "main = handlesUseLocalEncoding >> do ..." If you want to use "Latin1" then use either "ghc --encoding Latin1" or "main = handlesUseEncoding "Latin1" >> do ..." To compile older programs one could use "ghc --compat 6.8" or "ghc --encoding Latin1" to access the old defaults. One might even add "+RTS --encoding Latin1 -RTS" runtime options to set the initial encoding. Though I think this is unlikely to be useful in practice. I think that having terminal I/O be special is great for command line applications. But the nice behavior of such applications like "ls" must not determine what the GHC runtime does by default.

6334

Age (days ago)

6336

Last active (days ago)

List overview

Download

35 comments

14 participants

participants (14)

Ben Franksen
Bulat Ziganshin
Chris Kuklewicz
David Leuschner
David Roundy
Duncan Coutts
Ian Lynagh
Johan Tibell
John Meacham
Jules Bean
Roman Leshchinskiy
Ross Paterson
Simon Marlow
Takano Akio

H98 Text IO

David Leuschner

Bulat Ziganshin

Bulat Ziganshin

Johan Tibell

Bulat Ziganshin

David Leuschner

Johan Tibell

Takano Akio

tags

participants (14)