UTF-8 decoding error

Hi, with ghc-6.5.20060201 I get a "UTF-8 decoding error" for latin1 characters in my string literals. Do I have to change my sources or can I set a certain environment variable? I have LANG=de_DE@euro and LC_CTYPE not set (which is ok for hugs) Cheers Christian

Christian Maeder wrote:
with ghc-6.5.20060201 I get a "UTF-8 decoding error" for latin1 characters in my string literals.
Do I have to change my sources or can I set a certain environment variable?
I have LANG=de_DE@euro and LC_CTYPE not set (which is ok for hugs)
GHC is now expected source files to be UTF-8 only. I really did this as an experiment to see if anyone complained, because it will be more work to implement other encodings. You're the second person to notice this. So - do you need Latin-1, or could you use UTF-8? If you're using emacs, it's pretty easy to default to UTF-8 for haskell source files, BTW. Just add this to your .emacs: (modify-coding-system-alist 'file "\\.l?hs\\'" 'utf-8) Cheers, Simon

Simon Marlow wrote:
So - do you need Latin-1, or could you use UTF-8?
I'm not amused to change the encoding of many haskell source files (particular of those that are not mine). These files can then no longer be compiled by earlier ghcs (though I don't understand, how ghc-6.4.1 recognises the lexical error). I'm tempted to replace "ä" bei "\228" in literals. What does haddock do with utf-8 in comments? Will DrIFT -- using read- and writeFile -- still work correctly? Cheers Christian

Christian Maeder wrote:
So - do you need Latin-1, or could you use UTF-8?
I'm not amused to change the encoding of many haskell source files (particular of those that are not mine).
Fair enough, but there will have to be some way to specify the encoding, either via a pragma, command-line option, or the locale. I'm really not sure what is the best choice here. Perhaps all three, with locale being the default, overriden by pragmas and command-line options. The easiest way for us to handle encodings other than UTF-8 is for it to be a new preprocessing step, running 'iconv'. (but what do we do on Windows? bundle iconv? ew.) John - what do you plan to do here?
These files can then no longer be compiled by earlier ghcs (though I don't understand, how ghc-6.4.1 recognises the lexical error).
I'm tempted to replace "ä" bei "\228" in literals. What does haddock do with utf-8 in comments? Will DrIFT -- using read- and writeFile -- still work correctly?
Haddock needs to be updated too. But if GHC implements recoding via iconv, you can use GHC as a preprocesor to recode back to Latin-1; since you have to use GHC as a preprocessor with Haddock anyway, this shouldn't be much harder (of course, if you use non-Latin-1 characters this fails). Eventually, when Haddock runs on top of GHC, the issue will go away :) I don't know about DrIFT. Cheers, Simon

Simon Marlow wrote:
Christian Maeder wrote:
I'm tempted to replace "ä" bei "\228" in literals. What does haddock do with utf-8 in comments? Will DrIFT -- using read- and writeFile -- still work correctly?
The problem I fear is that writeFile does not produce a utf-8 encoded file: writeFile "t.hs" "main = putStrLn \"äöüßÄÖÜ\"" Using "\228\246\252\223\196\214\220" instead of "äöüßÄÖÜ" only avoids conversion to utf-8 of the initial file l1.hs (attached), but the generated file t.hs is a latin-1 file in both cases. Cheers Christian *Main> :l l1.hs Compiling Main ( l1.hs, interpreted ) Ok, modules loaded: Main. *Main> main *Main> :l t.hs Compiling Main ( t.hs, interpreted ) Ok, modules loaded: Main. *Main> main äöüßÄÖÜ

Christian Maeder wrote:
Simon Marlow wrote:
Christian Maeder wrote:
I'm tempted to replace "ä" bei "\228" in literals. What does haddock do with utf-8 in comments? Will DrIFT -- using read- and writeFile -- still work correctly?
The problem I fear is that writeFile does not produce a utf-8 encoded file:
writeFile "t.hs" "main = putStrLn \"äöüßÄÖÜ\""
Using "\228\246\252\223\196\214\220" instead of "äöüßÄÖÜ" only avoids conversion to utf-8 of the initial file l1.hs (attached), but the generated file t.hs is a latin-1 file in both cases.
Cheers Christian
*Main> :l l1.hs Compiling Main ( l1.hs, interpreted ) Ok, modules loaded: Main. *Main> main *Main> :l t.hs Compiling Main ( t.hs, interpreted ) Ok, modules loaded: Main. *Main> main äöüßÄÖÜ
I'm not sure I see the problem - the I/O library doesn't do unicode encoding/decoding, it always just takes the low 8 bits of each character, hence truncating Unicode to Latin-1. If you restrict yourself to Latin-1 characters in string literals, then I/O will work as expected (i.e. Latin-1 only). If you need to do I/O in a different encoding, I'm afraid you'll have to code it up yourself right now, or use some other library (there are packed string libraries around that can do I/O in UTF-8, for example, and Bulat's new I/O library does char encodings). Cheers, Simon

Simon Marlow wrote:
I'm not sure I see the problem - the I/O library doesn't do unicode encoding/decoding, it always just takes the low 8 bits of each character, hence truncating Unicode to Latin-1. If you restrict yourself to Latin-1 characters in string literals, then I/O will work as expected (i.e. Latin-1 only).
But if ghc-6.5 will expect utf-8 encoded source files all other haskell applications reading or writing haskell files must be adapted as well or am I wrong? C.

Christian Maeder wrote:
Simon Marlow wrote:
I'm not sure I see the problem - the I/O library doesn't do unicode encoding/decoding, it always just takes the low 8 bits of each character, hence truncating Unicode to Latin-1. If you restrict yourself to Latin-1 characters in string literals, then I/O will work as expected (i.e. Latin-1 only).
But if ghc-6.5 will expect utf-8 encoded source files all other haskell applications reading or writing haskell files must be adapted as well or am I wrong?
That's true. I guess what you're saying is that this is a problem for you, and your life would be easier if we supported Latin-1 as an encoding for source files again. That's fine - as I mentioned, I only restricted it to UTF-8 initially because (a) it was easier and (b) I wanted to see if anyone would be adversely affected. I've now added a ticket for this: http://cvs.haskell.org/trac/ghc/ticket/690 Thanks for the feedback! Cheers, Simon

On Fri, Feb 10, 2006 at 12:50:57PM +0000, Simon Marlow wrote:
That's true. I guess what you're saying is that this is a problem for you, and your life would be easier if we supported Latin-1 as an encoding for source files again. That's fine - as I mentioned, I only restricted it to UTF-8 initially because (a) it was easier and (b) I wanted to see if anyone would be adversely affected.
Another possibility is quasi-utf8 encoding. where it passes through any invalid utf8 sequences as latin1 characters. in practice, this works very well as interpreting both latin1 and utf8 transparently but is more than somewhat hacky. perhaps we can have quasi-utf8 be the default, with strict utf8 and latin1 being switches? John -- John Meacham - ⑆repetae.net⑆john⑈

John Meacham
Another possibility is quasi-utf8 encoding. where it passes through any invalid utf8 sequences as latin1 characters. in practice, this works very well as interpreting both latin1 and utf8 transparently but is more than somewhat hacky.
It would not be reliable. I'm strongly against that: it gives an illusion that Latin1 works, but it breaks in very rare cases. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/

Marcin 'Qrczak' Kowalczyk wrote:
John Meacham
writes: Another possibility is quasi-utf8 encoding. where it passes through any invalid utf8 sequences as latin1 characters. in practice, this works very well as interpreting both latin1 and utf8 transparently but is more than somewhat hacky.
It would not be reliable. I'm strongly against that: it gives an illusion that Latin1 works, but it breaks in very rare cases.
I tend to agree with Marcin here - that doesn't sound like a good solution. Incedentally, we do ignore encoding errors in comments (more by accident than by design, though :-). Cheers, Simon
participants (4)
-
Christian Maeder
-
John Meacham
-
Marcin 'Qrczak' Kowalczyk
-
Simon Marlow