
Please advise how to write Unicode string, so this example would work: main = do putStrLn "Les signes orthographiques inclus les accents (aigus, grâve, circonflexe), le tréma, l'apostrophe, la cédille, le trait d'union et la majuscule." I get the following error: hello.hs:4:68: lexical error in string/character literal (UTF-8 decoding error) Failed, modules loaded: none. Prelude> Also, how to read Unicode characters from standard input? Thanks! -- Dmitri O. Kondratiev dokondr@gmail.com http://www.geocities.com/dkondr

2008/11/22 Dmitri O.Kondratiev
Please advise how to write Unicode string, so this example would work:
main = do putStrLn "Les signes orthographiques inclus les accents (aigus, grâve, circonflexe), le tréma, l'apostrophe, la cédille, le trait d'union et la majuscule."
That really ought to work. Is the file encoded in UTF-8 (rather than, eg. latin-1)? Luke

That really ought to work. Is the file encoded in UTF-8 (rather than, eg. latin-1)?
This should pretend to work. Simple print functions garble unicode characters. For example :
putStrLn "Ну и где этот ваш хвалёный уникод?"
prints following output C 8 345 MB>B 20H E20;Q=K9 C=8:>4? Not pretty? Althrough Dmitri's variant seems to work fine.

Alexey Khudyakov wrote:
putStrLn "Ну и где этот ваш хвалёный уникод?"
:-) -- Dr. Janis Voigtlaender http://wwwtcs.inf.tu-dresden.de/~voigt/ mailto:voigt@tcs.inf.tu-dresden.de

alexey.skladnoy:
That really ought to work. Is the file encoded in UTF-8 (rather than, eg. latin-1)?
This should pretend to work. Simple print functions garble unicode characters. For example :
putStrLn "Ну и где этот ваш хвалёный уникод?"
prints following output
C 8 345 MB>B 20H E20;Q=K9 C=8:>4?
Not pretty? Althrough Dmitri's variant seems to work fine.
Use the UTF8 printing functions, import qualified System.IO.UTF8 as U main = U.putStrLn "Ну и где этот ваш хвалёный уникод?" Running this, *Main> main Ну и где этот ваш хвалёный уникод? -- Don

On Sat, 2008-11-22 at 10:02 -0800, Don Stewart wrote:
Use the UTF8 printing functions,
import qualified System.IO.UTF8 as U
main = U.putStrLn "Ну и где этот ваш хвалёный уникод?"
Running this,
*Main> main Ну и где этот ваш хвалёный уникод?
This upsets me. We need to get on with doing this properly. The System.IO.UTF8 module is a useful interim workaround but we're not using it properly most of the time. It is right when you're working with a text file that you know to be in the UTF-8 format. For example .cabal files are UTF-8, irrespective of the platform or the system locale. It is not right when working with the terminal. The encoding of the terminal is given by the locale. We cannot statically declare that it is UTF-8. The right thing to do is to make Prelude.putStrLn do the right thing. We had a long discussion on how to fix the H98 IO functions to do this better. We just need to get on with it, or we'll end up with too many cases of people using System.IO.UTF8 inappropriately. For the case where System.IO.UTF8 is right we probably still want a more general solution, like a handle setting for the encoding. Duncan

This upsets me. We need to get on with doing this properly. The System.IO.UTF8 module is a useful interim workaround but we're not using it properly most of the time.
... skipped ...
The right thing to do is to make Prelude.putStrLn do the right thing. We had a long discussion on how to fix the H98 IO functions to do this better. We just need to get on with it, or we'll end up with too many cases of people using System.IO.UTF8 inappropriately.
But this bring question what "the right thing" is? If locale is UTF8 or system support unicode some other way - no problem, just encode string properly. Problem is how to deal with untanslatable characters. Skip? Replace with question marks? Anything other? Probably we need to look how this is solved in other languages. (Or not solved) And this problem related not only to IO. It raises whenever strings cross border between haskell world and outside world. Opening files with unicode names, execing, etc. For example: Prelude> readFile "файл" *** Exception: D09;: openFile: does not exist (No such file or directory) Prelude> executeFile "echo" True ["Сейчас сломается"] Nothing !59G0A A;><05BAO Althrough it's possible to work around using encodeString/decodeString from Codec.Binary.UTF8.String it won't work on non-UTF8 systems. It's not only neandertalian systems with one-byte locales, windows AFAIK uses other unicode encoding.

alexey.skladnoy:
This upsets me. We need to get on with doing this properly. The System.IO.UTF8 module is a useful interim workaround but we're not using it properly most of the time.
... skipped ...
The right thing to do is to make Prelude.putStrLn do the right thing. We had a long discussion on how to fix the H98 IO functions to do this better. We just need to get on with it, or we'll end up with too many cases of people using System.IO.UTF8 inappropriately.
But this bring question what "the right thing" is? If locale is UTF8 or system support unicode some other way - no problem, just encode string properly. Problem is how to deal with untanslatable characters. Skip? Replace with question marks? Anything other? Probably we need to look how this is solved in other languages. (Or not solved)
And this problem related not only to IO. It raises whenever strings cross border between haskell world and outside world. Opening files with unicode names, execing, etc.
For example: Prelude> readFile "файл" *** Exception: D09;: openFile: does not exist (No such file or directory) Prelude> executeFile "echo" True ["Сейчас сломается"] Nothing !59G0A A;><05BAO
Althrough it's possible to work around using encodeString/decodeString from Codec.Binary.UTF8.String it won't work on non-UTF8 systems. It's not only neandertalian systems with one-byte locales, windows AFAIK uses other unicode encoding.
For just decoding / encoding in other locales, there are codec libraries. Hunt around on hackage. http://hackage.haskell.org/cgi-bin/hackage-scripts/package/encoding http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Encode -- Don

Hello Alexey, Sunday, November 23, 2008, 10:20:47 AM, you wrote:
And this problem related not only to IO. It raises whenever strings cross border between haskell world and outside world. Opening files with unicode names, execing, etc.
this completely depends on libraries, and ghc-bundled i/o libs doesn't support unicode filenames. freearc project contains its own simple i/o library that doesn't have this problem (and also support files >4gb on windows). unfortunately, this library doesn't include any buffering -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Alexey Khudyakov wrote:
But this bring question what "the right thing" is? If locale is UTF8 or system support unicode some other way - no problem, just encode string properly. Problem is how to deal with untanslatable characters. Skip? Replace with question marks? Anything other? Probably we need to look how this is solved in other languages. (Or not solved)
Regarding untranslatable characters, I think the only correct thing to do is consider it exceptional behavior and have the conversion function accept a handler function which takes the character as input and produces a string for it. That way programs can define their own behavior, since this is something that doesn't have a "right" way to recover in all cases. Canonical handlers which skip, replace with question marks (or other arbitrary character), throw actual exceptions, etc could be provided for convenience. For stream-based "strings" a al ByteString, dealing with this sort of a handler in an efficient manner is fairly straightforward (though some CPS tricks may be needed to get rid of the Maybe in the result of the basic converter). For [Char] strings efficiency is harder, but the implementation should still be easy (given the basic converter). Most extant languages I've seen tend to pick a single solution for all cases, but I don't think we should follow along that path. -- Live well, ~wren

Excerpts from Dmitri O.Kondratiev's message of Sat Nov 22 05:40:41 -0600 2008:
Please advise how to write Unicode string, so this example would work:
main = do putStrLn "Les signes orthographiques inclus les accents (aigus, grâve, circonflexe), le tréma, l'apostrophe, la cédille, le trait d'union et la majuscule."
I get the following error: hello.hs:4:68: lexical error in string/character literal (UTF-8 decoding error) Failed, modules loaded: none. Prelude>
Also, how to read Unicode characters from standard input?
Thanks!
Hi, Check out the utf8-string package on hackage: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/utf8-string In particular, you probably want the System.IO.UTF8 functions, which are identical to to their non-utf8 counterparts in System.IO except, well, they handle unicode properly. More specifically, you will probably want to mainly look at Codec.Binary.UTF8.String.encodeString and decodeString, respectively (in fact, most of the System.IO.UTF8 functions are defined in terms of these, e.g. 'putStrLn x = IO.putStrLn (encodeString x)' and 'getLine = liftM decodeString IO.getLine'.) Austin

Please advise how to write Unicode string, so this example would work:
main = do putStrLn "Les signes orthographiques inclus les accents (aigus, grâve, circonflexe), le tréma, l'apostrophe, la cédille, le trait d'union et la majuscule." (...)
Besides the Haskell stuff, you probably want to check if your terminal outputs utf-8. I use a nice X terminal named 'mlterm'. It's main goal is to support unicode. But I don't know enough to tell you how to check your terminal, or even if just changing to mlterm will always work. Sometimes, I wonder why distributions don't just agree on considering support for anything but utf-8 a bug (except in 'iconv', of course). Well, there's probably someone out there who would have problems with that, and I don't want problems for anyone. But I hope their problems would be worst than mine trying to deal with different encodings. Best, Maurício
participants (10)
-
Alexey Khudyakov
-
Austin Seipp
-
Bulat Ziganshin
-
Dmitri O.Kondratiev
-
Don Stewart
-
Duncan Coutts
-
Janis Voigtlaender
-
Luke Palmer
-
Mauricio
-
wren ng thornton