RE: [Haskell-cafe] Hugs vs GHC (again) was: Re: Some random newbiequestions

newer
RE: [Haskell-cafe] Re: Some random...

older
RE: [Haskell-cafe] Implementing...

Simon Marlow

7 Jan 2005 7 Jan '05

11:56 a.m.

Here's a summary of the state of Unicode support in GHC and other compilers. There are several aspects: - Can the Char type hold the full range of Unicode characters? This has been true in GHC for some time, and is now true in Hugs. I don't think it's true in nhc98 (please correct me if I'm wrong). - Do the character class functions (isUpper, isAlpha etc.) work correctly on the full range of Unicode characters? This is true in Hugs. It's true with GHC on some systems (basically we were lazy and used the underlying C library's support here, which is patchy). - Can you use (some encoding of) Unicode for your Haskell source files? I don't think this is true in any Haskell compiler right now. - Can you do String I/O in some encoding of Unicode? No Haskell compiler has support for this yet, and there are design decisions to be made. Some progress has been made on an experimental prototype (see recent discussion on this list). - What about Unicode FilePaths? This was discussed a few months ago on the haskell(-cafe) list, no support yet in any compiler. Cheers, Simon On 07 January 2005 00:52, Dimitry Golubovsky wrote:

...

Hi,

Looks like Hugs and GHC are being compared again ;)

I am just interested to know, what is the current status of Unicode support in GHC? Hugs has had it for about a year (or more, in CVS) at least at the level of recognizing character categories and simple case conversions based on the Unicode database files. Also UTF-8 or locale-based I/O encoding conversion to internal Unicode is available. Does GHC has similar support?

Some time ago (about 1.5 years) I tried to play with Unicode I/O in GHC, and it looked like it did not have much Unicode support back then (at least on I/O level). Has anything progressed in this regard since then?

Most of this list subscribers seem to be GHC users, so can anybody answer?

BTW when answering the original post (brief quote below) different aspects were mentioned, but not internationalization ones. Is it really not that important?

Dimitry Golubovsky Middletown, CT

Benjamin Pierce wrote:

...
* What are the relative advantages of Hugs and GHC, beyond the obvious (Hugs is smaller and easier for people not named Simon to modify, while GHC is a real compiler and has the most up-to-date hacks to the type checker)? Do people generally use one or the other for everything, or are they similar enough to use Hugs at some moments and GHC at others?

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Show replies by date

Lennart Augustsson

7 Jan 7 Jan

12:11 p.m.

New subject: Hugs vs GHC (again) was: Re: Some random newbiequestions

...

Here's a summary of the state of Unicode support in GHC and other compilers. There are several aspects:

- Can the Char type hold the full range of Unicode characters? This has been true in GHC for some time, and is now true in Hugs. I don't think it's true in nhc98 (please correct me if I'm wrong).

- Do the character class functions (isUpper, isAlpha etc.) work correctly on the full range of Unicode characters? This is true in Hugs. It's true with GHC on some systems (basically we were lazy and used the underlying C library's support here, which is patchy).

- Can you use (some encoding of) Unicode for your Haskell source files? I don't think this is true in any Haskell compiler right now. Well, even if hbc is mostly dead I must point out that it has supported

Simon Marlow wrote: this since Unicode was first added to Haskell. As well as the point above, of course. If the GHC implementors feel lazy they can always borrow the Unicode (plane 0) description table from HBC. It is a 64k file. -- Lennart

Dimitry Golubovsky

1:01 p.m.

New subject: Unicode: Hugs vs GHC (again) was: Re: Some random newbie questions

Hi, Lennart Augustsson wrote:

...

Simon Marlow wrote:

...
Here's a summary of the state of Unicode support in GHC and other compilers. There are several aspects:

- Can the Char type hold the full range of Unicode characters? This has been true in GHC for some time, and is now true in Hugs. I don't think it's true in nhc98 (please correct me if I'm wrong).

I remember, it was in GHC. But any attempt to output Unicode characters using standard I/O functions always ended up outputting only low 8 bits. Has anything changed since then?

...

...
- Do the character class functions (isUpper, isAlpha etc.) work correctly on the full range of Unicode characters? This is true in Hugs. It's true with GHC on some systems (basically we were lazy and used the underlying C library's support here, which is patchy).

Which basically means that one with older or underconfigured system where they do not have permissions/technical possibilities to configure locales in the C library properly is out of luck...

...

...
- Can you use (some encoding of) Unicode for your Haskell source files? I don't think this is true in any Haskell compiler right now.

Well, Hugs from CVS accepts source code in UTF-8 (I am not sure about locale-based conversion) - at least on my computer. Another thing, string literals may be in UTF-8 encoding, but Hugs would not accept function/type identifiers in Unicode (i. e. one could not name a type or a function in Russian for instance - their names muct be ASCII). I put an example of such a file in UTF-8 on my web-server: http://www.golubovsky.org/software/hugs-patch/testutf.hs

...

Well, even if hbc is mostly dead I must point out that it has supported this since Unicode was first added to Haskell. As well as the point above, of course. If the GHC implementors feel lazy they can always borrow the Unicode (plane 0) description table from HBC. It is a 64k file.

Or in Hugs, there is a shell script (awk indeed, just wrapped in a shell script) which parses the Unicode data file and produces a C file (also about 64k), and compact set of primitive functions independent from C library - src/unix/mkunitable and part of src/char.c in the Hugs source tree respectively. The reason I asked this question was: I am trying to understand, where is internationalization of Haskell compilers on their developers' list of priorities, and also how high is demand from users to have at least basic internationalization. Dimitry Golubovsky Middletown, CT

Malcolm Wallace

12:29 p.m.

New subject: Hugs vs GHC (again) was: Re: Some random newbiequestions

"Simon Marlow" writes:

...

Here's a summary of the state of Unicode support in GHC and other compilers. There are several aspects:

- Can the Char type hold the full range of Unicode characters? This has been true in GHC for some time, and is now true in Hugs. I don't think it's true in nhc98 (please correct me if I'm wrong).

You're wrong :-). nhc98 has always had 32-bit characters internally.

...

- Do the character class functions (isUpper, isAlpha etc.) work correctly on the full range of Unicode characters? This is true in Hugs. It's true with GHC on some systems (basically we were lazy and used the underlying C library's support here, which is patchy).

In nhc98, currently the character class functions work only on the 8-bit Latin-1 range.

...

- Can you use (some encoding of) Unicode for your Haskell source files? I don't think this is true in any Haskell compiler right now.

Many years ago, hbc claimed to be the only compiler with support for this.

...

- Can you do String I/O in some encoding of Unicode? No Haskell compiler has support for this yet, and there are design decisions to be made. Some progress has been made on an experimental prototype (see recent discussion on this list).

Apparently some Haskell/XML toolkits already do I/O conversions in a selection of the encodings permitted by the XML standard, namely ASCII, Latin-1, UTF-8, and UTF-16 (either byte ordering), but not yet UCS-4 (four possible byte orderings), or EBCDIC. See for example: http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.12/src/Text/XML/HaXm...

...

- What about Unicode FilePaths? This was discussed a few months ago on the haskell(-cafe) list, no support yet in any compiler.

Indeed, AFAIK. Regards, Malcolm

Aaron Denney

8 Jan 8 Jan

8:08 a.m.

New subject: Hugs vs GHC (again) was: Re: Some random newbiequestions

On 2005-01-07, Simon Marlow wrote:

...

- Can you use (some encoding of) Unicode for your Haskell source files? I don't think this is true in any Haskell compiler right now.

I assume this won't be be done until the next one is done...

...

- Can you do String I/O in some encoding of Unicode? No Haskell compiler has support for this yet, and there are design decisions to be made. Some progress has been made on an experimental prototype (see recent discussion on this list).

Many of the easy ways to do this that I've heard proposed make the current hacks for binary IO fail. IMHO, we really, really, need a standard, supported way to do binary IO. If I can read in and output octets, then I can implement unicode handling on top of that. In fact it would let a bunch of the proposed ideas for unicode support can be implemented in pure haskell and have API details hashed out and polished. For unix, there are couple different tacks one could take. The locale system is standard, and does work, but is ugly and a pain to work with. In particular, it's another (set of) global variables. And what do you do with a character not expressible in the current locale? I'd like to possibility of different character sets for different files, for example. I suppose I wouldn't be too upset at using the locale information, but defaulting to UTF-8, rather than ASCII for unset character set information. For win32, I really don't know the options.

...

- What about Unicode FilePaths? This was discussed a few months ago on the haskell(-cafe) list, no support yet in any compiler.

This is tricky, because most systems don't have such a thing terribly standard. For win32, it is standardized and should be wrappable fairly easily, but I don't know that I'd want to base my model on that. For unix, again, there is the locale system, with, again, the problem of unrepresentable characters. Traditionally systems have essentially said "file names are zero-terminated strings of bytes that may not contain character 47, which is used to seperate directory names", and the interpretation as a matter of _names_ and _characters_ was entirely a matter up to the terminals (or graphical programs, eventually) for display and programs for manipulation. -- Aaron Denney -><-

David Roundy

1:06 p.m.

New subject: Hugs vs GHC (again) was: Re: Some random newbiequestions

On Sat, Jan 08, 2005 at 08:08:38AM +0000, Aaron Denney wrote:

...

I suppose I wouldn't be too upset at using the locale information, but defaulting to UTF-8, rather than ASCII for unset character set information.

But if we default to a UTF-8 encoding, then there could be decoding failures when attempting to read a file. You have to consider both the possibility that characters aren't expressible in a given encoding and the possibility that files aren't expressible in a given encoding. ASCII (or iso-whatever) has the advantage that at least every file is readable. But as you say, really what we need is a binary IO system first, and then the character-based IO can do whatever it likes without breaking things (since it'll only be used by programs that actually want unicode coding). -- David Roundy http://www.darcs.net

Marcin 'Qrczak' Kowalczyk

9 Jan 9 Jan

7:03 p.m.

New subject: Hugs vs GHC (again) was: Re: Some random newbiequestions

"Simon Marlow" writes:

...

- Do the character class functions (isUpper, isAlpha etc.) work correctly on the full range of Unicode characters? This is true in Hugs. It's true with GHC on some systems (basically we were lazy and used the underlying C library's support here, which is patchy).

It's not obvious what the predicates should really mean, e.g. should isDigit and isHexDigit include non-ASCII digits or should isSpace include non-breaking space characters. Haskell 98 report gives some guidelines which don't necessarily coincide with the C practice nor with expectations from Unicode people. Once this is agreed, it would be easy to make scripts which generate C code from UnicodeData.txt tables from Unicode. I think table-driven predicates and toUpper/toLower should better be implemented in C; Haskell is not good at static constant tables with numbers. Another issue is that the set of predicates provided by Haskell 98 library report is not enough to implement e.g. a Haskell 98 lexer, which needs "any Unicode symbol or punctuation". Case mapping would be better done string -> string rather than character -> character; this breaks a long established Haskell interface. Case mapping is locale-sensitive (in very minor ways). Haskell doesn't provide algorithms like normalization or collation. In general the Haskell 98 interface is not enough for complex Unicode processing.

...

- Can you do String I/O in some encoding of Unicode? No Haskell compiler has support for this yet, and there are design decisions to be made.

The problem with designing an API of recoders is that depending on whether the recoder is implemented in Haskell or interfaced from C, it needs different data representation. Pure Haskell recoders prefer lazy lists of characters or bytes (except that a desire to detect source errors or characters unavailable in the target encoding breaks this), while high performance C prefers pointers to buffers with chunks of text. Transparent recoding makes some behavior hard to express. Imagine parsing HTTP headers followed by "\r\n\r\n" and a binary file. If you read headers line by line and decoding is performed in blocks, then once you determine where the headers end it's too late to find the start of the binary file: a part of it has already been decoded into text. You have to determine the end of the headers while working with bytes, not characters, and only convert the first part. Not performing the recoding in blocks is tricky if the decoder is implemented in C. Giving 1-byte buffers for lots of iconv() calls is not nice. Or imagine parsing a HTML file with the encoding specified inside it in a <meta> element. Switching the encoding in the middle is incompatible with buffering. Maybe the best option is to parse the beginning in ISO-8859-1 just to determine the encoding, and then reparse everything again once the encoding is known. If characters are recoded automatically on I/O, one is tempted to extend the framework for other conversions like compression, line ending convention, HTML character escaping etc.

...

- What about Unicode FilePaths? This was discussed a few months ago on the haskell(-cafe) list, no support yet in any compiler.

Nobody knows what the semantics should be. I've once written elsewhere a short report about handling filename encodings in various languages and environments which use Unicode as their string representation. Here it is (I've been later corrected that Unicode non-characters are valid in UTF-x): I describe here languages which exclusively use Unicode strings. Some languages have both byte strings and Unicode strings (e.g. Python) and then byte strings are generally used for strings exchanged with the OS, the programmer is responsible for the conversion if he wishes to use Unicode. I consider situations when the encoding is implicit. For I/O of file contents it's always possible to set the encoding explicitly somehow. Corrections are welcome. This is mostly based on experimentation. Java (Sun) ---------- Strings are UTF-16. Filenames are assumed to be in the locale encoding. a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD. b) Creating. Characters which cannot be converted are replaced by "?". Command line arguments and standard I/O are treated in the same way. Java (GNU) ---------- Strings are UTF-16. Filenames are assumed to be in Java-modified UTF-8. a) Interpreting. If a filename cannot be converted, a directory listing contains a null instead of a string object. b) Creating. All Java characters are representable in Java-modified UTF-8. Obviously not all potential filenames can be represented. Command line arguments are interpreted according to the locale. Bytes which cannot be converted are skipped. Standard I/O works in ISO-8859-1 by default. Obviously all input is accepted. On output characters above U+00FF are replaced by "?". C# (mono) --------- Strings are UTF-16. Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS environment variable, with UTF-8 implicitly added at the end. These encodings are tried in order. a) Interpreting. If a filename cannot be converted, it's skipped in a directory listing. The documentation says that if a filename, a command line argument etc. looks like valid UTF-8, it is treated as such first, and MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases. The reality seems to not match this (mono-1.0.5). b) Creating. If UTF-8 is used, Non-characters are converted to pseudo-UTF-8, U+0000 throws an exception (System.ArgumentException: Path contains invalid chars), paired surrogates are treated correctly, and an isolated surrogate causes an internal error: ** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion failed: (utf8!=NULL) aborting... Command line arguments are treated in the same way, except that if an argument cannot be converted, the program dies at start: [Invalid UTF-8] Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea). Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again. Console.WriteLine emits UTF-8. Paired surrogates are treated correctly, non-characters and unpaired surrogates are converted to pseudo-UTF-8. Console.ReadLine interprets text as UTF-8. Bytes which cannot be converted are skipped. Perl ---- Depending on the convention used by a particular function and on imported packages, a Perl string is treated either as Perl-modified Unicode (with character values up to 32 bits or 64 bits depending on the architecture) or as an unspecified locale encoding. It has two internal representations: ISO-8859-1 and Perl-modified UTF-8 (with an extended range). If every Perl string is assumed to be a Unicode string, then filenames are effectively ISO-8859-1. a) Interpreting. Characters up to 0xFF are used. b) Creating. If the filename has no characters above 0xFF, it is converted to ISO-8859-1. Otherwise it is converted to Perl-modified UTF-8 (all characters, not just those above 0xFF). Command line arguments and standard I/O are treated in the same way, i.e. ISO-8859-1 on input and a mixture of ISO-8859-1 and UTF-8 on output, depending on the contents. This behavior is modifiable by importing various packages and using interpreter invocation flags. When Perl is told that command line arguments are UTF-8, the behavior for strings which cannot be converted is inconsistent: sometimes it's treated as ISO-8859-1, sometimes an error is signalled. Haskell ------- Haskell nominally uses Unicode. There is no conversion framework standarized or implemented yet though. Implementations which support more than 256 characters currently assume ISO-8859-1 for filenames, command line arguments and all I/O, taking the lowest 8 bits of a character code on output. Common Lisp: Clisp ------------------ Common Lisp standard doesn't say anything about string encoding. In Clisp strings are UTF-32 (internally optimized as UCS-2 and ISO-8859-1 when possible). Any character code up to U+10FFFF is allowed, including non-characters and isolated surrogates. Filenames are assumed to be in the locale encoding. a) Interpreting. If a byte cannot be converted, an exception is thrown. b) Creating. If a character cannot be converted, an exception is thrown. Kogut (my language; this is the current state - may be changed) ----- Strings are UTF-32 (internally optimized as ISO-8859-1 when possible). Currently any character code up to U+10FFFF is allowed, including non-characters and isolated surrogates. Filenames are assumed to be in the locale encoding. I plan to add an environment variable which can override this default. A program can itself set the encoding to something else, perhaps locally during execution of some code. It can use a conversion which puts U+FFFD / "?" instead of throwing an exception on error, or which does something else. a) Interpreting. If a byte cannot be converted, an exception is thrown. b) Creating. If a character cannot be converted, an exception is thrown. U+0000 terminates the filename (this should be fixed). Command line arguments and standard I/O are treated in the same way. GNOME ----- GNOME uses UTF-8 internally, or sometimes byte strings in other encodings. I guess filenames are passed as byte strings. AFAIK sometimes filenames are expressed as URLs, even internally when it's invisible to the user, and then various unsafe bytes are escaped as two hex digits preceded by the percent sign. From the programmer's point of view the original byte strings are generally used. Filename encoding matters for the display though, so here I describe the user's point of view. If the environment variable G_FILENAME_ENCODING is present, it specifies the encoding of filenames, unless it is @locale which means the encoding of the locale. If it's not present but G_BROKEN_FILENAMES is present, filenames are assumed to be in the locale encoding. If neither variable is present, filenames are assumed to be in UTF-8. a) Interpreting. If a filename cannot be converted from the selected encoding, all non-ASCII bytes are shown as octal numbers preceded by the backslash, as hex numbers preceded by the percent sign, or as question marks, depending on the situation (I can observe all three cases in gedit). What is physically stored is the byte string and the file is opened successfully. b) Creating. If a character cannot be represented, the application refuses to save the file until a good filename is entered. Mozilla ------- I don't know how it handles filenames internally. From the user's point of view it matters how it presents a local directory listing. Filenames are assumed to be in the locale encoding. If a filename cannot be converted, it's skipped. If it can be converted but contains characters like 0x80-0x9F in ISO-8859-2, they are displayed as question marks and the file is inaccessible. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Jérémy Bobbio

7:19 p.m.

New subject: Hugs vs GHC (again) was: Re: Some random newbiequestions

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 9 janv. 05, at 20:03, Marcin 'Qrczak' Kowalczyk wrote:

...

Once this is agreed, it would be easy to make scripts which generate C code from UnicodeData.txt tables from Unicode. I think table-driven predicates and toUpper/toLower should better be implemented in C; Haskell is not good at static constant tables with numbers.

Sebastien Carlier already wrote this for hOp, see : http://etudiants.insia.org/~jbobbio/hOp/Gen_wctype.hs Cheers, Jérémy. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (Darwin) iD8DBQFB4YO22PUjs9fQ72URAmkYAJ9bM3dgW003JByEAv11pPsjUPxJpgCdEHFp 1G76TJObsKboeEOUIky15Xw= =fP3j -----END PGP SIGNATURE-----

Marcin 'Qrczak' Kowalczyk

11 Jan 11 Jan

12:47 a.m.

New subject: Hugs vs GHC (again) was: Re: Some random newbiequestions

Jérémy Bobbio writes:

...

...
Once this is agreed, it would be easy to make scripts which generate C code from UnicodeData.txt tables from Unicode. I think table-driven predicates and toUpper/toLower should better be implemented in C; Haskell is not good at static constant tables with numbers.

Sebastien Carlier already wrote this for hOp, see : http://etudiants.insia.org/~jbobbio/hOp/Gen_wctype.hs

And I've done a similar thing for my language Kogut some time ago: http://cvs.sourceforge.net/viewcvs.py/kokogut/kokogut/runtime/make-char-tabl... http://cvs.sourceforge.net/viewcvs.py/kokogut/kokogut/lib/Core/Kokogut/Chara... Let's see how these separately developed interpretations of predicates differ (mine also have different names and there are a few more): |Sebastien's| mine -------+-----------+---------- alnum | L* N* | L* N* alpha | L* | L* cntrl | Cc | Cc Zl Zp digit | N* | Nd lower | Ll | Ll punct | P* | P* upper | Lu | Lt Lu blank | Z* \t\n\r | Z*(except U+00A0 U+2007 U+202F) \t\n\v\f\r U+0085 Note that the interpretation of "digit" differs from both C and Haskell 98 which specify it to be ASCII-only. Actually I have ASCII-only variants of IsDigit parametrized by the number base. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Ketil Malde

10 Jan 10 Jan

10:13 a.m.

New subject: Hugs vs GHC (again) was: Re: Some random newbiequestions

Marcin 'Qrczak' Kowalczyk writes:

...

...
- Do the character class functions (isUpper, isAlpha etc.) work correctly on the full range of Unicode characters?

...

It's not obvious what the predicates should really mean, e.g. should isDigit and isHexDigit include non-ASCII digits or should isSpace include non-breaking space characters.

I think perhaps the answer is all of the above. The functions could be defined in multiple modules, so that 'ASCII.isSpace' would match the "normal" space character only, while 'Unicode.isSpace' could match all the weird and wonderful stuff in the standard. I also have the feeling that 'String' and/or 'Char' should be classes rather than data types (perhaps with 'String' built on top of a more general 'Sequence' type?) Ideally, you could treat an array as well as a list as a string. JM$0.02 -kzm -- If I haven't seen further, it is by standing in the footprints of giants

7477

Age (days ago)

7481

Last active (days ago)

List overview

Download

9 comments

9 participants

participants (9)

Aaron Denney
David Roundy
Dimitry Golubovsky
Jérémy Bobbio
Ketil Malde
Lennart Augustsson
Malcolm Wallace
Marcin 'Qrczak' Kowalczyk
Simon Marlow