What's the status with unicode characters on haddock ?

Hello all, I made a small program for my factory and I wanted to try to document it using haddock. The thing is, the comments are in French and the resulting html pages are unreadable because the accentuated letters are mangled. It's not acceptable to use HTML entities, as I'd like the comments to remain readable when/if I edit the code. Anyone has had the same problem ? Found a workaround ? Thanks, David.

On Fri, Jul 10, 2009 at 8:54 AM, david48
Hello all,
I made a small program for my factory and I wanted to try to document it using haddock. The thing is, the comments are in French and the resulting html pages are unreadable because the accentuated letters are mangled.
It's not acceptable to use HTML entities, as I'd like the comments to remain readable when/if I edit the code.
Anyone has had the same problem ? Found a workaround ?
Not that I have any hope of being able to answer your question, but I think it might be useful if you informed us _where_ the characters are mangled. Is it when you view it in a browser, or when you open the Haddock-generated HTML files in a text editor? /M -- Magnus Therning (OpenPGP: 0xAB4DFBA4) magnus@therning.org Jabber: magnus@therning.org http://therning.org/magnus identi.ca|twitter: magthe

On Пятница 10 июля 2009 12:55:46 Magnus Therning wrote:
Not that I have any hope of being able to answer your question, but I think it might be useful if you informed us where the characters are mangled. Is it when you view it in a browser, or when you open the Haddock-generated HTML files in a text editor?
Both. I just checked it. Source files were UTF8 encoded and every 2-byte letter converted to 1-byte so nothing could be read.

On Fri, Jul 10, 2009 at 10:55 AM, Magnus Therning
On Fri, Jul 10, 2009 at 8:54 AM, david48
wrote:
Not that I have any hope of being able to answer your question, but I think it might be useful if you informed us _where_ the characters are mangled. Is it when you view it in a browser, or when you open the Haddock-generated HTML files in a text editor?
Sorry for the lack of information. They're mangled both in vim and the browser. For example, in vim é becomes ^B Thanks, David.

I ran a little experiment of my own, using a GHC HEAD build of a week or so ago. Here's a hex dump of my test source, so that we can see that it's really UTF-8. $ od -xc Test.hs 0000000 6f6d 7564 656c 4d20 6961 206e 6877 7265 m o d u l e M a i n w h e r 0000020 0a65 2d0a 202d 207c 7250 6e69 7374 7420 e \n \n - - | P r i n t s t 0000040 6568 7420 7865 2074 4822 6c65 6f6c 7720 h e t e x t " H e l l o w 0000060 726f 646c 2e22 2d0a 202d 6548 6572 7327 o r l d " . \n - - H e r e ' s 0000100 6120 6520 7275 206f 6973 6e67 202c 82e2 a e u r o s i g n , 342 202 0000120 20ac 5528 322b 4130 2943 202c 6e61 2064 254 ( U + 2 0 A C ) , a n d 0000140 6e61 6520 656c 656d 746e 6f2d 2066 6973 a n e l e m e n t - o f s i 0000160 6e67 203a 88e2 208a 5528 322b 3032 2941 g n : 342 210 212 ( U + 2 2 0 A ) 0000200 0a2e 616d 6e69 3a20 203a 4f49 2820 0a29 . \n m a i n : : I O ( ) \n 0000220 616d 6e69 3d20 7020 7475 7453 4c72 206e m a i n = p u t S t r L n 0000240 4822 6c65 6f6c 7720 726f 646c 0a22 " H e l l o w o r l d " \n 0000256 Then I invoked $ haddock -h Test.hs The generated Main.html contains this tag: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8"> Firefox picks this up, because in the View menu, Character Encoding is set to UTF-8. Yet, I see the little blocks instead of the characters from my source file! Why? $ od -xc Main.html ... 0003220 6120 6520 7275 206f 6973 6e67 202c 2004 a e u r o s i g n , 004 ... 0003260 6520 656c 656d 746e 6f2d 2066 6973 6e67 e l e m e n t - o f s i g n 0003300 203a 2004 5528 322b 3032 2941 0a2e 2f3c : 004 ( U + 2 2 0 A ) . \n < / It seems that Haddock replaced both characters with a 0x04 (ASCII end-of-transmission) byte! Apparently you've hit a bug in Haddock. Since Haskell source files are UTF-8 by definition, and the HTML file it produces is also UTF-8, this is clearly incorrect behaviour. Thomas

Am Freitag, 10. Juli 2009 09:54 schrieb david48:
Hello all,
I made a small program for my factory and I wanted to try to document it using haddock. The thing is, the comments are in French and the resulting html pages are unreadable because the accentuated letters are mangled.
It's not acceptable to use HTML entities, as I'd like the comments to remain readable when/if I edit the code.
Anyone has had the same problem ? Found a workaround ?
Thanks,
David.
To my knowledge, Haddock only supports ASCII as input encoding. If you want to have characters outside ASCII, you have to escape them using something like . Best wishes, Wolfgang

On Fri, Jul 17, 2009 at 4:37 PM, Wolfgang
Jeltsch
To my knowledge, Haddock only supports ASCII as input encoding. If you want to have characters outside ASCII, you have to escape them using something like .
Which would mean, while editing the code I'd have to read comments like that : -- | s lection de l' tat Which becomes totally unreadable. :( David

Am Freitag, 17. Juli 2009 16:43 schrieben Sie:
On Fri, Jul 17, 2009 at 4:37 PM, Wolfgang
Jeltsch
wrote: To my knowledge, Haddock only supports ASCII as input encoding. If you want to have characters outside ASCII, you have to escape them using something like .
Which would mean, while editing the code I'd have to read comments like that :
-- | s lection de l' tat
Which becomes totally unreadable.
:(
Yes, it’s a pity. For me, it’s not such a big problem since I don’t write my Haddock comments in my native language (German) but in English. I only experience this problem because I use nice typography, i.e., “ ” – instead of " " -. GHC supports UTF-8 input, and Haddock uses GHC nowadays. So, in my opinion, Haddock should also support UTF-8 input. Do you want to file a feature request? Best wishes, Wolfgang

On Fri, Jul 17, 2009 at 4:05 PM, Wolfgang
Jeltsch
Yes, it’s a pity. For me, it’s not such a big problem since I don’t write my Haddock comments in my native language (German) but in English. I only experience this problem because I use nice typography, i.e., “ ” – instead of " " -.
I would write the comments in English, but as it is, it's a little piece of code for our factory that's never going to be released. Still, I wanted to document it properly, and my boss can't read English.
GHC supports UTF-8 input, and Haddock uses GHC nowadays. So, in my opinion, Haddock should also support UTF-8 input. Do you want to file a feature request?
Sure. I'm registering to haddock trac site and will search the tickets. David.

On Fri, Jul 17, 2009 at 4:15 PM, david48
On Fri, Jul 17, 2009 at 4:05 PM, Wolfgang
GHC supports UTF-8 input, and Haddock uses GHC nowadays. So, in my opinion, Haddock should also support UTF-8 input. Do you want to file a feature request?
Sure. I'm registering to haddock trac site and will search the tickets.
There are two tickets already about unicode or character handling: #20 and #116. It doesn't look like it's a hot issue :( David.
participants (5)
-
david48
-
Khudyakov Alexey
-
Magnus Therning
-
Thomas ten Cate
-
Wolfgang Jeltsch