gbp sign showing as unknown character by GHC

Quick question: I've tested this in a couple of different terminals (roxterm and xterm), so I'm fairly sure it's GHC that's the problem. Have I missed a setting? GHCi, version 6.10.4 Prelude> putStrLn "£" � Hugs98 200609-3 Hugs> putStrLn "£" £ I get the same character output from a password generator I've writtern, after compilation with GHC [iainb]$ ./makepass2 50 2 >> testfile.txt [iainb]$ cat testfile.txt H(xW!:maNyxZ;h,IW=Uu4G$ztc>k@Q[g6?y:�TbG&5Nd")+"5+ Iain

On Wed, Aug 19, 2009 at 10:31 AM, Iain Barnett
Quick question: I've tested this in a couple of different terminals (roxterm and xterm), so I'm fairly sure it's GHC that's the problem. Have I missed a setting? GHCi, version 6.10.4 Prelude> putStrLn "£" � Hugs98 200609-3 Hugs> putStrLn "£" £
ghc-6.10.4 and earlier don't automatically encode/decode Unicode characters. So on terminals which don't use the latin-1 encoding, you need to do the conversion explicitly with a separate package such as utf8-string, iconv or text-icu. For example, on OS X: $ echo $LANG en_US.UTF-8 $ ghci Prelude> putStrLn "£" ? Prelude> System.IO.UTF8.putStrLn "£" £ The conversion is done automatically by hugs, which is why the outputs differ. This feature will also be supported in ghc-6.12. -Judah

"Judah" == Judah Jacobson
writes:
Judah> On Wed, Aug 19, 2009 at 10:31 AM, Iain Barnett

Hello Colin, Thursday, August 20, 2009, 10:13:28 AM, you wrote:
I don't understand where latin-1 comes into this. String is supposed to be a list of Unicode characters.
but ghc 6.10 i/o used String as list of bytes -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

"Bulat" == Bulat Ziganshin
writes:
Bulat> Hello Colin, Bulat> Thursday, August 20, 2009, 10:13:28 AM, you wrote:
I don't understand where latin-1 comes into this. String is supposed >> to be a list of Unicode characters.
Bulat> but ghc 6.10 i/o used String as list of bytes But how do you get Latin-1 bytes from a Unicode string? This would need a transcoding process. -- Colin Adams Preston Lancashire

On Thu, Aug 20, 2009 at 4:28 PM, Colin Paul
Adams
But how do you get Latin-1 bytes from a Unicode string? This would need a transcoding process.
The first 256 code-points of Unicode coincide with Latin-1. Therefore, if you truncate Unicode characters down to 8 bits you'll effectively end up with Latin-1 text (except that any code points above U+00FF will give strange results). If your terminal then interprets these bytes as UTF-8 (or anything else, really), the result will be gibberish or worse. Stuart

"Stuart" == Stuart Cook
writes:
Stuart> On Thu, Aug 20, 2009 at 4:28 PM, Colin Paul
Stuart> Adams

Hello Colin, Thursday, August 20, 2009, 11:12:53 AM, you wrote:
Yes, but surely this will work both ways. The same bytes on input should come back on output, shouldn't they?
only ascii subset that have fixed encoding. the rest may migrate in some way -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Thu, Aug 20, 2009 at 5:12 PM, Colin Paul
Adams
Yes, but surely this will work both ways. The same bytes on input should come back on output, shouldn't they?
I would have thought so, but apparently this isn't actually what happens. GHCi, version 6.8.2: http://www.haskell.org/ghc/ :? for help Loading package base ... linking ... done. Prelude> map Data.Char.ord "饁" [39233] <== 0x9941 Prelude> putStrLn "饁" A <== 0x41 It seems that GHCi is clever enough to decode UTF-8 input, which only serves to confuse System.IO even more. Stuart

Stuart Cook
GHCi, version 6.8.2: http://www.haskell.org/ghc/ :? for help Loading package base ... linking ... done. Prelude> map Data.Char.ord "饁" [39233] <== 0x9941 Prelude> putStrLn "饁" A <== 0x41
It seems that GHCi is clever enough to decode UTF-8 input, which only serves to confuse System.IO even more.
I get: GHCi, version 6.8.2: http://www.haskell.org/ghc/ :? for help Loading package base ... linking ... done. Prelude> map Data.Char.ord "饁" [39233] and Prelude> map Data.Char.ord "£" [163] but also: % ghci -e 'map Data.Char.ord "饁"' <interactive>:1:21: lexical error in string/character literal at character '\129' but again: % ghci -e 'map Data.Char.ord "£"' [194,163] So GHCi used interactively translates input from the terminal's UTF-8, but outputs truncates output to eight bits. Executing a string with -e, it appears to read byte for byte (which I think was the original behavior at some point). -k -- If I haven't seen further, it is by standing in the footprints of giants

2009/8/20 Ketil Malde
Stuart Cook
writes: GHCi, version 6.8.2: http://www.haskell.org/ghc/ :? for help Loading package base ... linking ... done. Prelude> map Data.Char.ord "饁" [39233] <== 0x9941 Prelude> putStrLn "饁" A <== 0x41
It seems that GHCi is clever enough to decode UTF-8 input, which only serves to confuse System.IO even more.
I get:
GHCi, version 6.8.2: http://www.haskell.org/ghc/ :? for help Loading package base ... linking ... done. Prelude> map Data.Char.ord "饁" [39233]
and
Prelude> map Data.Char.ord "£" [163]
but also:
% ghci -e 'map Data.Char.ord "饁"' <interactive>:1:21: lexical error in string/character literal at character '\129'
but again:
% ghci -e 'map Data.Char.ord "£"' [194,163]
So GHCi used interactively translates input from the terminal's UTF-8, but outputs truncates output to eight bits. Executing a string with -e, it appears to read byte for byte (which I think was the original behavior at some point).
-k --
I get the same behaviour here. If you want to try Latin 1 (ISO-8859-1) then you can use a utility called Luit (maybe only Linux?) luit -encoding ISO-8859-1 ghci £ becomes £, but gives the same byte output as above. Iain

On Aug 20, 2009, at 05:07 , Ketil Malde wrote:
% ghci -e 'map Data.Char.ord "饁"' <interactive>:1:21: lexical error in string/character literal at character '\129'
but again:
% ghci -e 'map Data.Char.ord "£"' [194,163]
So GHCi used interactively translates input from the terminal's UTF-8, but outputs truncates output to eight bits. Executing a string with -e, it appears to read byte for byte (which I think was the original behavior at some point).
Makes sense; absent utf8-string, System.Environment.getArgs only groks bytes. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Got this back from the bug tracker 6.12.1 will have Unicode support in the IO library which mostly fixes this problem. The rest is fixed by #3398. Iain
participants (7)
-
Brandon S. Allbery KF8NH
-
Bulat Ziganshin
-
Colin Paul Adams
-
Iain Barnett
-
Judah Jacobson
-
Ketil Malde
-
Stuart Cook