UTF-8 decoding error - Glasgow-haskell-users - Haskell.org

newer
RE: [Haskell-cafe] GHC Core still...

UTF-8 decoding error

older
RE: [Haskell] Expecting more...

Christian Maeder

20 Sep 2006 20 Sep '06

4:14 p.m.

How can I convince ghc version 6.5.20060919 to accept latin1 characters in literals? I wish to keep source files (containing umlauts in strings) that can be compiled by either ghc-6.4.2 and ghc-6.6. Christian

Reply

Sign in to reply online Use email software

Show replies by date

Duncan Coutts

20 Sep 20 Sep

8:42 p.m.

On Wed, 2006-09-20 at 18:14 +0200, Christian Maeder wrote:

How can I convince ghc version 6.5.20060919 to accept latin1 characters in literals?

I wish to keep source files (containing umlauts in strings) that can be compiled by either ghc-6.4.2 and ghc-6.6.

You can use numeric escapes like "\222". Duncan

Reply

Sign in to reply online Use email software

Christian Maeder

21 Sep 21 Sep

8:15 a.m.

Duncan Coutts schrieb:

On Wed, 2006-09-20 at 18:14 +0200, Christian Maeder wrote:

...
How can I convince ghc version 6.5.20060919 to accept latin1 characters in literals?

I wish to keep source files (containing umlauts in strings) that can be compiled by either ghc-6.4.2 and ghc-6.6.

You can use numeric escapes like "\222".

How about resolving http://cvs.haskell.org/trac/ghc/ticket/690 ? I would like to simple add a flag to ghc. How about adding an addional phase (and extension) to ghc before preprocessing? Your solution does not support good readability. Is there a ready tool that converts my source files as you suggested automatically? How does haddock handle characters in comments? I'm sort of stuck currently testing ghc-6.6 RC Christian

Reply

Sign in to reply online Use email software

Ross Paterson

8:36 a.m.

On Thu, Sep 21, 2006 at 10:15:45AM +0200, Christian Maeder wrote:

How does haddock handle characters in comments?

Section 3.8.3 of the Haddock manual: 3.8.3. Character references Although Haskell source files may contain any character from the Unicode character set, the encoding of these characters as bytes varies between systems, so that only source files restricted to the ASCII character set are portable. Other characters may be specified in character and string literals using Haskell character escapes. To represent such characters in documentation comments, Haddock supports SGML-style numeric character references of the forms &#D; and &#xH; where D and H are decimal and hexadecimal numbers denoting a code position in Unicode (or ISO 10646). For example, the references λ, λ and λ all represent the lower-case letter lambda. Not pretty, but it is portable and not limited to the Latin-1 subset.

Reply

Sign in to reply online Use email software

Christian Maeder

9:49 a.m.

currently haddock correctly translates latin1 chars, ie. äöü to äöü So it would be nice if also ghc-6.6 could remain backward compatible by supporting latin1 sources. Christian Ross Paterson schrieb:

On Thu, Sep 21, 2006 at 10:15:45AM +0200, Christian Maeder wrote:

...
How does haddock handle characters in comments?

Section 3.8.3 of the Haddock manual:

3.8.3. Character references

Although Haskell source files may contain any character from the Unicode character set, the encoding of these characters as bytes varies between systems, so that only source files restricted to the ASCII character set are portable. Other characters may be specified in character and string literals using Haskell character escapes. To represent such characters in documentation comments, Haddock supports SGML-style numeric character references of the forms &#D; and &#xH; where D and H are decimal and hexadecimal numbers denoting a code position in Unicode (or ISO 10646). For example, the references λ, λ and λ all represent the lower-case letter lambda.

Not pretty, but it is portable and not limited to the Latin-1 subset.

Reply

Sign in to reply online Use email software

Duncan Coutts

8:36 a.m.

On Thu, 2006-09-21 at 10:15 +0200, Christian Maeder wrote:

Duncan Coutts schrieb:

...
On Wed, 2006-09-20 at 18:14 +0200, Christian Maeder wrote:

...
How can I convince ghc version 6.5.20060919 to accept latin1 characters in literals?

I wish to keep source files (containing umlauts in strings) that can be compiled by either ghc-6.4.2 and ghc-6.6.

You can use numeric escapes like "\222".

How about resolving http://cvs.haskell.org/trac/ghc/ticket/690 ?

I would like to simple add a flag to ghc. How about adding an addional phase (and extension) to ghc before preprocessing?

Your solution does not support good readability. Is there a ready tool that converts my source files as you suggested automatically? How does haddock handle characters in comments?

There is iconv. It could be used as a pre-processor with ghc's -F -pgmF -optF flags. Sorry there isn't a better solution at the moment. You could petition for an {-# ENCODING ISO-8859-1 #-} pragma as mentioned in that ticket. Duncan

Reply

Sign in to reply online Use email software

Christian Maeder

12:58 p.m.

Duncan Coutts schrieb:

There is iconv. It could be used as a pre-processor with ghc's -F -pgmF -optF flags.

NB: -F is missing in the Flag reference A simple script for the pgmF command #!/bin/sh iconv -f l1 -t utf-8 $2 > $3 worked for me, thanks!

Sorry there isn't a better solution at the moment. You could petition for an {-# ENCODING ISO-8859-1 #-} pragma as mentioned in that ticket.

This seems unnecessary now. Christian

Reply

Sign in to reply online Use email software

Christian Maeder

22 Sep 22 Sep

3:19 p.m.

Christian Maeder schrieb:

Duncan Coutts schrieb:

...
There is iconv. It could be used as a pre-processor with ghc's -F -pgmF -optF flags.

NB: -F is missing in the Flag reference

A simple script for the pgmF command

#!/bin/sh iconv -f l1 -t utf-8 $2 > $3

worked for me, thanks!

The only disadvantage is that the filename in error and warning messages is quite useless: [ 15 of 400] Compiling Data.Generics2.Instances ( syb-generics/Data/Generics2/Instances.hs, syb-generics/Data/Generics2/Instances.o ) /tmp/ghc5667_0/ghc5667_248.hspp:299:17: Couldn't match expected type `forall a1. (Data ctx a1) => c (t a1)' against inferred type `c1 (t1 a1)' Expected type: (forall a2. (Data ctx a2) => c (t a2)) -> Maybe (c [a]) Inferred type: c1 (t1 a1) -> Maybe (c1 (t' a1)) In the expression: gcast1 In the definition of `dataCast1': dataCast1 _ = gcast1

Reply

Sign in to reply online Use email software

Christian Maeder

3:47 p.m.

New subject: PS. compiler change, was: UTF-8 decoding error

Christian Maeder schrieb:

[ 15 of 400] Compiling Data.Generics2.Instances ( syb-generics/Data/Generics2/Instances.hs, syb-generics/Data/Generics2/Instances.o )

/tmp/ghc5667_0/ghc5667_248.hspp:299:17: Couldn't match expected type `forall a1. (Data ctx a1) => c (t a1)' against inferred type `c1 (t1 a1)' Expected type: (forall a2. (Data ctx a2) => c (t a2)) -> Maybe (c [a]) Inferred type: c1 (t1 a1) -> Maybe (c1 (t' a1)) In the expression: gcast1 In the definition of `dataCast1': dataCast1 _ = gcast1

This particular error is fixed by writing: dataCast1 _ f = gcast1 f (for "dataCast1 _ = gcast1") C.

Reply

Sign in to reply online Use email software

Duncan Coutts

4:04 p.m.

On Fri, 2006-09-22 at 17:19 +0200, Christian Maeder wrote:

Christian Maeder schrieb:

...
Duncan Coutts schrieb:

...
There is iconv. It could be used as a pre-processor with ghc's -F -pgmF -optF flags.

NB: -F is missing in the Flag reference

A simple script for the pgmF command

#!/bin/sh iconv -f l1 -t utf-8 $2 > $3

worked for me, thanks!

The only disadvantage is that the filename in error and warning messages is quite useless:

[ 15 of 400] Compiling Data.Generics2.Instances ( syb-generics/Data/Generics2/Instances.hs, syb-generics/Data/Generics2/Instances.o )

/tmp/ghc5667_0/ghc5667_248.hspp:299:17:

I think you can fix this by pre-pending a {-# LINE #-} pragma in your script. Something like: #!/bin/sh ( echo "{-# LINE 1 \"$2\" #-}" ; iconv -f l1 -t utf-8 $2 ) > $3 Duncan

Reply

Sign in to reply online Use email software

Christian Maeder

4:17 p.m.

Duncan Coutts schrieb:

...
/tmp/ghc5667_0/ghc5667_248.hspp:299:17:

I think you can fix this by pre-pending a {-# LINE #-} pragma in your script. Something like:

#!/bin/sh ( echo "{-# LINE 1 \"$2\" #-}" ; iconv -f l1 -t utf-8 $2 ) > $3

Yes, thanks again! C.

Reply

Sign in to reply online Use email software

Christian Maeder

9 Oct 9 Oct

11:55 a.m.

Duncan Coutts schrieb:

On Fri, 2006-09-22 at 17:19 +0200, Christian Maeder wrote:

...
...
A simple script for the pgmF command

...
The only disadvantage is that the filename in error and warning messages is quite useless:

I think you can fix this by pre-pending a {-# LINE #-} pragma in your script. Something like:

May it be that import chasing takes longer now? I noticed quite a gap before ghc started to compile my 624 modules. Christian

Reply

Sign in to reply online Use email software

Duncan Coutts

5:52 p.m.

On Mon, 2006-10-09 at 13:55 +0200, Christian Maeder wrote:

Duncan Coutts schrieb:

...
On Fri, 2006-09-22 at 17:19 +0200, Christian Maeder wrote:

...
...
...
A simple script for the pgmF command

...
...
The only disadvantage is that the filename in error and warning messages is quite useless:

...
I think you can fix this by pre-pending a {-# LINE #-} pragma in your script. Something like:

May it be that import chasing takes longer now? I noticed quite a gap before ghc started to compile my 624 modules.

Yes, since it has to run the pre-processor before it can look at the imports. Duncan

Reply

Sign in to reply online Use email software

Bulat Ziganshin

10 Oct 10 Oct

1:52 p.m.

New subject: reading imports from .hi files?

Hello Duncan, Monday, October 9, 2006, 9:52:34 PM, you wrote:

...
...
...
...
A simple script for the pgmF command

May it be that import chasing takes longer now? I noticed quite a gap before ghc started to compile my 624 modules.

Yes, since it has to run the pre-processor before it can look at the imports.

isn't it possible to check that source files was not changed and look up imports from .hi files in this case? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Reply

Sign in to reply online Use email software

Matthew Pocock

27 Sep 27 Sep

10:05 a.m.

Fortress (sun's possibly-not-vaporware hpc language) supports arbitrary unicode chars in code, and has an escape syntax for commonly used things. Similarly, proof-general/isabelle supports tex-style escapes for symbols & greek. It seems to me that a pre-processor that turns human-friendly escapes (e.g. \{lambda} rather than some magic number) into unicode and a slightly intelligent IDE (or emacs mode?) would go most of the way to letting us use up-side-down ys and curly as with all the visual beauty and editor niceness that we have now with ascii. Matthew On Wednesday 20 September 2006 21:42, Duncan Coutts wrote:

On Wed, 2006-09-20 at 18:14 +0200, Christian Maeder wrote:

...
How can I convince ghc version 6.5.20060919 to accept latin1 characters in literals?

I wish to keep source files (containing umlauts in strings) that can be compiled by either ghc-6.4.2 and ghc-6.6.

You can use numeric escapes like "\222".

Duncan

_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Reply

Sign in to reply online Use email software

Jan-Willem Maessen

7:27 p.m.

On Sep 27, 2006, at 6:05 AM, Matthew Pocock wrote:

Fortress (sun's possibly-not-vaporware hpc language) supports arbitrary unicode chars in code, and has an escape syntax for commonly used things.

I have spent the past week writing Fortress code (which runs in parallel, even). But I'm perhaps a special case. :-)

Similarly, proof-general/isabelle supports tex-style escapes for symbols & greek. It seems to me that a pre-processor that turns human- friendly escapes (e.g. \{lambda} rather than some magic number) into unicode and a slightly intelligent IDE (or emacs mode?) would go most of the way to letting us use up-side-down ys and curly as with all the visual beauty and editor niceness that we have now with ascii.

In Fortress we spent a *lot* of effort making the "TWiki" syntax as painless as possible for stuff which we planned to use often (for example, -> and => turn into Unicode arrows, and the language syntax is defined in terms of them). One source of both encouragement and frustration is the fact that every unicode code point has an associated description. We support using these descriptions---and various shortenings of them, since they are too verbose for day-to- day use. The frustration is that the names or their shortenings are not necessarily unique. For characters which only occur in strings this is less critical, but a little effort will go a long way. One heuristic we've used is: "if I do a diff on the ASCII representation provided by my version control system, will I be able to read the result?" We of course have a little program which processes an official unicode character table (downloaded from the web) plus some information about our special cases and uses it to generate the appropriate conversion functions. This is important because Unicode is constantly changing (mostly getting bigger). -Jan-Willem Maessen Fortress developer, Haskell hacker

Matthew

On Wednesday 20 September 2006 21:42, Duncan Coutts wrote:

...
On Wed, 2006-09-20 at 18:14 +0200, Christian Maeder wrote:

...
How can I convince ghc version 6.5.20060919 to accept latin1 characters in literals?

I wish to keep source files (containing umlauts in strings) that can be compiled by either ghc-6.4.2 and ghc-6.6.

You can use numeric escapes like "\222".

Duncan

_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Reply

Sign in to reply online Use email software

6841

Age (days ago)

6861

Last active (days ago)

Download

15 comments

6 participants

tags

participants (6)

Bulat Ziganshin
Christian Maeder
Duncan Coutts
Jan-Willem Maessen
Matthew Pocock
Ross Paterson