darcs patch: Add UTF8 converting and outputing functions

Thu Oct 12 22:55:03 JST 2006 mukai@jmuk.org * Add UTF8 converting and outputing functions In GHC 6.6, the source code can include UTF-8 characters and converted them to Unicode chars. However, there are no (easy) way to convert them from/to UTF-8 in standard libraries of GHC. When outputing unicode characters in current GHC, hPutChar overflows their codepoints, and just prints lower 8-bit. This has not been a serious problem because past haskell programs contained few unicode characters. By now, the codes containing unicode characters become widespread because of supporting UTF-8 encoding. Therefore, base library must have some ways to deal with this problem, IMHO. Best regards, Jun Mukai mukai@jmuk.org

Hi Jun, On Sun, Oct 15, 2006 at 01:56:07AM +0900, mukai@jmuk.org wrote:
hunk ./Data/Char.hs 54 +#ifdef __GLASGOW_HASKELL__ + -- * Converting UTF-8 from/to encoding + , charToUTF8Chars -- :: Char -> [Char] + , toUTF8String -- :: String -> String + , fromUTF8String -- :: String -> String +#endif
Why is all this GHC-only?
hunk ./System/IO.hs 186 +import GHC.Enum +import Data.Char (chr, charToUTF8Chars) hunk ./System/IO.hs 344 + [...] +hPutCharUTF8 :: Handle -> Char -> IO () +hPutCharUTF8 hndl c = hPutStr hndl $ map (chr.fromEnum) $ charToUTF8Chars c + [...] +putStrLnUTF8 = hPutStrLnUTF8 stdout +
If the earlier stuff really does need to be GHC-only, shouldn't this be too? Thanks Ian

Hello Ian,
Why is all this GHC-only? If the earlier stuff really does need to be GHC-only, shouldn't this be too?
I agree with you. There are no needs to be GHC-only. However, my code imports GHC.List and GHC.Word (not Data.List and Data.Word because they import Data.Char). These requirements are implementation dependent. I have little knowledge about other implementation; I do not know the appropriate module names for NHC or Hugs or other implementations. I have no testing environments for them. And also, the problem of UTF-8 encoding is currently GHC specific, IMHO. I'd like to solve this at first. I intend to discuss whether UTF-8 code is included in base or not. The restrictions can be lately released, if needed. Thanks Jun Mukai mukai@jmuk.org

mukai:
Why is all this GHC-only? If the earlier stuff really does need to be GHC-only, shouldn't this be too?
I agree with you. There are no needs to be GHC-only.
However, my code imports GHC.List and GHC.Word (not Data.List and Data.Word because they import Data.Char). These requirements are implementation dependent. I have little knowledge about other implementation; I do not know the appropriate module names for NHC or Hugs or other implementations. I have no testing environments for them.
If you do want to work on the base library, its _essential_ in my opinion to install Hugs, and test the code under it as well. This is quite easy if you develop the code separately in a cabalised library, since you can switch between Hugs and GHC with ./Setup.hs configure --hugs. Really really useful feature. I also recommend using QuickCheck to define properties, since then those properties can be tested under Hugs and GHC too. In fact, I'd go so far as to propose that no pure code be allowed into base without accompanying QuickCheck properties.... Any takers? Should we formalise a policy for adding contributions to the core libraries? Along the lines of, must use: * haddockised * comes with QuickChecks * full type annotations * cabalised * ... -- Don

On Thu, 19 Oct 2006 13:27:02 +0900, Donald Bruce Stewart
In fact, I'd go so far as to propose that no pure code be allowed into base without accompanying QuickCheck properties.... Any takers?
Should we formalise a policy for adding contributions to the core libraries? Along the lines of, must use:
* haddockised * comes with QuickChecks * full type annotations * cabalised * ...
And Extra Libraries. Adding contributions to extra libraries is not easy task, too. (Except special case that its person becomes new maintainer.) -- shelarcy <shelarcy capella.freemail.ne.jp> http://page.freett.com/shelarcy/

Jun Mukai
Why is all this GHC-only?
I agree with you. There are no needs to be GHC-only.
However, my code imports GHC.List and GHC.Word (not Data.List and Data.Word because they import Data.Char). These requirements are implementation dependent.
But what do you use from GHC.List and GHC.Word that are not available in Data.List or Data.Word? At a glance, I could not see anything special.
And also, the problem of UTF-8 encoding is currently GHC specific,
Not entirely. Other compilers may or may not yet read source files in a specific encoding, but the general haskell programmer is certainly going to want to do so, for arbitrary text. Regards, Malcolm

Hello Jun, Thursday, October 19, 2006, 7:31:43 AM, you wrote:
UTF-8 code
jfyi: jhc compiler includes functions that does the same job. i've included appropriate files. there is also fast ghc-optimized code that does utf8 encoding/decoding, but i don't remember where i've seen it -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Bluat,
On Thu, 19 Oct 2006 22:13:54 +0900, Bulat Ziganshin
UTF-8 code
jfyi: jhc compiler includes functions that does the same job. i've included appropriate files. there is also fast ghc-optimized code that does utf8 encoding/decoding, but i don't remember where i've seen it
I think ghc-optimized code is here. http://cvs.haskell.org/darcs/ghc-6.6/ghc/compiler/utils/Encoding.hs So you can use this function by ghc package (GHC as a Library). But ... It's so bad that dependding on GHC as a Library. I think better way is exposing this or wrapper function in GHC.Prim. Best Regards, -- shelarcy <shelarcy capella.freemail.ne.jp> http://page.freett.com/shelarcy/

On Sun, Oct 15, 2006 at 01:56:07AM +0900, mukai@jmuk.org wrote:
In GHC 6.6, the source code can include UTF-8 characters and converted them to Unicode chars. However, there are no (easy) way to convert them from/to UTF-8 in standard libraries of GHC.
As others have pointed out, the conversion functions are not GHC-specific: charToUTF8Chars :: Char -> [Word8] toUTF8String :: String -> [Word8] fromUTF8String :: [Word8] -> String The fromUTF side probably also needs a way to report illegal encodings and incomplete encodings. As for I/O part, your implementation assumes that hPutChar writes a byte to a Handle, which is currently the case in GHC, but this is arguably a bug, and it's not the case in Hugs and Jhc. I think we need to work out a plan for Unicode I/O in Haskell, and then work towards that. For the current state, see http://haskell.galois.com/cgi-bin/haskell-prime/trac.cgi/wiki/CharAsUnicode
participants (8)
-
Bulat Ziganshin
-
dons@cse.unsw.edu.au
-
Ian Lynagh
-
Jun Mukai
-
Malcolm Wallace
-
mukai@jmuk.org
-
Ross Paterson
-
shelarcy