How to input Unicode string in Haskell program?

Imagine we have this simple program: module Main(main) where main = do x <- getLine putStrLn x Now I want to run it somehow, enter "résumé 履歴書 резюме" and see this string printed back as "résumé 履歴書 резюме". Now, the first problem is that my computer runs Windows, which means that I can't use ghci ":main" or result of "ghc main.hs" to enter such an outrageous string — Windows console is locked to one specific local code page, and no codepage contains Latin-1, Cyrillic and Kanji symbols at the same time. But there is also WinGHCi. So I do ":main", copy-paste this string into the window (It works! Because Windows has Unicode for 20 years now), but the output is all messed up. In a rather curious way, actually: the input string is converted to UTF-8 byte string, and its bytes are treated as being characters from my local code page. So, it appears that I have no way to enter Unicode strings into my Haskell programs by hands, I should read them from files. That's sad, and I refuse to think I am the first one with such a problem, so I assume there is a solution/workaround. Now would someone please tell me this solution? Except from "Just stick to 127 letters of ASCII", of course.

Have you tried running ghci inside Emacs?
Отправлено с iPhone
21.02.2013, в 13:58, Semyon Kholodnov
Imagine we have this simple program:
module Main(main) where
main = do x <- getLine putStrLn x
Now I want to run it somehow, enter "résumé 履歴書 резюме" and see this string printed back as "résumé 履歴書 резюме". Now, the first problem is that my computer runs Windows, which means that I can't use ghci ":main" or result of "ghc main.hs" to enter such an outrageous string — Windows console is locked to one specific local code page, and no codepage contains Latin-1, Cyrillic and Kanji symbols at the same time.
But there is also WinGHCi. So I do ":main", copy-paste this string into the window (It works! Because Windows has Unicode for 20 years now), but the output is all messed up. In a rather curious way, actually: the input string is converted to UTF-8 byte string, and its bytes are treated as being characters from my local code page.
So, it appears that I have no way to enter Unicode strings into my Haskell programs by hands, I should read them from files. That's sad, and I refuse to think I am the first one with such a problem, so I assume there is a solution/workaround. Now would someone please tell me this solution? Except from "Just stick to 127 letters of ASCII", of course.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

The problem is that Prelude.getLine uses current locale to load characters: for example if you have utf8 locale, then everything works out of the box:
$ runhaskell 1.hs résumé 履歴書 резюме résumé 履歴書 резюме
But if you change locale you'll have error:
LANG="C" runhaskell 1.hs résumé 履歴書 резюме 1.hs: <stdin>: hGetLine: invalid argument (invalid byte sequence)
To force haskell use UTF8 you can load string as byte sequence and convert
it to UTF-8
charecters for example by
import qualified Data.ByteString as S
import qualified Data.Text.Encoding as T
main = do
x <- fmap T.decodeUtf8 S.getLine
now code will work even with different locale, and you'll load UTF8 from
shell
independenty of user input's there
--
Alexander
On 21 February 2013 13:58, Semyon Kholodnov
Imagine we have this simple program:
module Main(main) where
main = do x <- getLine putStrLn x
Now I want to run it somehow, enter "résumé 履歴書 резюме" and see this string printed back as "résumé 履歴書 резюме". Now, the first problem is that my computer runs Windows, which means that I can't use ghci ":main" or result of "ghc main.hs" to enter such an outrageous string — Windows console is locked to one specific local code page, and no codepage contains Latin-1, Cyrillic and Kanji symbols at the same time.
But there is also WinGHCi. So I do ":main", copy-paste this string into the window (It works! Because Windows has Unicode for 20 years now), but the output is all messed up. In a rather curious way, actually: the input string is converted to UTF-8 byte string, and its bytes are treated as being characters from my local code page.
So, it appears that I have no way to enter Unicode strings into my Haskell programs by hands, I should read them from files. That's sad, and I refuse to think I am the first one with such a problem, so I assume there is a solution/workaround. Now would someone please tell me this solution? Except from "Just stick to 127 letters of ASCII", of course.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Alexander

You can also set the locale encoding for a handle (e.g.
System.IO.stdin) from code using `System.IO.hSetEncoding` [0].
Erik
[0] http://hackage.haskell.org/packages/archive/base/latest/doc/html/System-IO.h...
On Thu, Feb 21, 2013 at 12:07 PM, Alexander V Vershilov
The problem is that Prelude.getLine uses current locale to load characters: for example if you have utf8 locale, then everything works out of the box:
$ runhaskell 1.hs résumé 履歴書 резюме résumé 履歴書 резюме
But if you change locale you'll have error:
LANG="C" runhaskell 1.hs résumé 履歴書 резюме 1.hs: <stdin>: hGetLine: invalid argument (invalid byte sequence)
To force haskell use UTF8 you can load string as byte sequence and convert it to UTF-8 charecters for example by
import qualified Data.ByteString as S import qualified Data.Text.Encoding as T
main = do x <- fmap T.decodeUtf8 S.getLine
now code will work even with different locale, and you'll load UTF8 from shell independenty of user input's there
-- Alexander
On 21 February 2013 13:58, Semyon Kholodnov
wrote: Imagine we have this simple program:
module Main(main) where
main = do x <- getLine putStrLn x
Now I want to run it somehow, enter "résumé 履歴書 резюме" and see this string printed back as "résumé 履歴書 резюме". Now, the first problem is that my computer runs Windows, which means that I can't use ghci ":main" or result of "ghc main.hs" to enter such an outrageous string — Windows console is locked to one specific local code page, and no codepage contains Latin-1, Cyrillic and Kanji symbols at the same time.
But there is also WinGHCi. So I do ":main", copy-paste this string into the window (It works! Because Windows has Unicode for 20 years now), but the output is all messed up. In a rather curious way, actually: the input string is converted to UTF-8 byte string, and its bytes are treated as being characters from my local code page.
So, it appears that I have no way to enter Unicode strings into my Haskell programs by hands, I should read them from files. That's sad, and I refuse to think I am the first one with such a problem, so I assume there is a solution/workaround. Now would someone please tell me this solution? Except from "Just stick to 127 letters of ASCII", of course.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Alexander
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Alexander V Vershilov
The problem is that Prelude.getLine uses current locale to load characters: for example if you have utf8 locale, then everything works out of the box:
$ runhaskell 1.hs résumé 履歴書 резюме résumé 履歴書 резюме
But if you change locale you'll have error:
LANG="C" runhaskell 1.hs résumé 履歴書 резюме 1.hs: <stdin>: hGetLine: invalid argument (invalid byte sequence)
That seems to be correct behaviour: the only way to know the meaning of the bits input by a user is what encoding the user says they are in. But in general this issue is an instance of inheriting sins from the OS: the meaning of the bit pattern in a file should be part of the file, but we are stuck with OSs that use a global variable (which should be anathema to Haskell). So if user A has locale set one way and inputs a file and sends the filename to user B on the same system, user B might well see something completely different to A when looking at the file.
To force haskell use UTF8 you can load string as byte sequence and convert it to UTF-8 charecters
but of course, the programmer can only hope that utf-8 will work here. If the user is typing in KOI-8R, reading it as utf-8 is going to be wrong. -- Jón Fairbairn Jon.Fairbairn@cl.cam.ac.uk

I would like to point out again that I am talking about Windows. I
don't care about Linux—I'm sure you already threw away all those
stupid legacy one- and multibyte code pages and migrated to UTF8
completely, but that's not quite the current state of Windows. Console
still doesn't cope with Unicode quite well.
Anyway, the problem is partially solved: I patched my WinGHCi so it no
longer chokes on Unicode input, and as for compiled .exe... I'll see.
2013/2/22, Jon Fairbairn
Alexander V Vershilov
writes: The problem is that Prelude.getLine uses current locale to load characters: for example if you have utf8 locale, then everything works out of the box:
$ runhaskell 1.hs résumé 履歴書 резюме résumé 履歴書 резюме
But if you change locale you'll have error:
LANG="C" runhaskell 1.hs résumé 履歴書 резюме 1.hs: <stdin>: hGetLine: invalid argument (invalid byte sequence)
That seems to be correct behaviour: the only way to know the meaning of the bits input by a user is what encoding the user says they are in.
But in general this issue is an instance of inheriting sins from the OS: the meaning of the bit pattern in a file should be part of the file, but we are stuck with OSs that use a global variable (which should be anathema to Haskell). So if user A has locale set one way and inputs a file and sends the filename to user B on the same system, user B might well see something completely different to A when looking at the file.
To force haskell use UTF8 you can load string as byte sequence and convert it to UTF-8 charecters
but of course, the programmer can only hope that utf-8 will work here. If the user is typing in KOI-8R, reading it as utf-8 is going to be wrong. -- Jón Fairbairn Jon.Fairbairn@cl.cam.ac.uk
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 13-02-21 04:58 AM, Semyon Kholodnov wrote:
— Windows console is locked to one specific local code page, and no codepage contains Latin-1, Cyrillic and Kanji symbols at the same time.
Windows console is not locked to an anti-international code page; it is only defaulted to. Use CHCP 65001 to switch to the UTF-8 code page. Unfortunately, code page and encoding is only half of the battle; the other half is fonts. Most Windows fonts are incomplete; all Windows fixed-width fonts are incomplete. (Silver lining: Arial Unicode is sufficiently complete.) Therefore, you may be unable to display some characters, but they are the correct characters.
participants (6)
-
Albert Y. C. Lai
-
Alexander V Vershilov
-
Erik Hesselink
-
Jon Fairbairn
-
MigMit
-
Semyon Kholodnov