Encoding of Haskell source files

newer
Problems porting ghc6-6.10.1 to...

older
Re: [Haskell-cafe] Using _ on the...

Roel van Dijk

4 Apr 2011 4 Apr '11

8:46 a.m.

Hello, The Haskell 2010 language specification states that "Haskell uses the Unicode character set" [1]. I interpret this as saying that, at the lowest level, a Haskell program is a sequence of Unicode code points. The standard doesn't say how such a sequence should be encoded. You can argue that the encoding of source files is not part of the language. But I think it would be highly practical to standardise on an encoding scheme. Strictly speaking it is not possible to reliably exchange Haskell source files on the byte level. If I download some package from hackage I can't tell how the source files are encoded from just looking at the files. I propose a few solutions: A - Choose a single encoding for all source files. This is wat GHC does: "GHC assumes that source files are ASCII or UTF-8 only, other encodings are not recognised" [2]. UTF-8 seems like a good candidate for such an encoding. B - Specify encoding in the source files. Start each source file with a special comment specifying the encoding used in that file. See Python for an example of this mechanism in practice [3]. It would be nice to use already existing facilities to specify the encoding, for example: {-# ENCODING <encoding name> #-} An interesting idea in the Python PEP is to also allow a form recognised by most text editors: # -*- coding: <encoding name> -*- C - Option B + Default encoding Like B, but also choose a default encoding in case no specific encoding is specified. I would further like to propose to specify the encoding of haskell source files in the language standard. Encoding of source files belongs somewhere between a language specification and specific implementations. But the language standard seems to be the most practical place. This is not an official proposal. I am just interested in what the Haskell community has to say about this. Regards, Roel [1] - http://www.haskell.org/onlinereport/haskell2010/haskellch2.html#x7-150002.1 [2] - http://www.haskell.org/ghc/docs/7.0-latest/html/users_guide/separate-compila... [3] - http://www.python.org/dev/peps/pep-0263/

Show replies by date

Colin Adams

4 Apr 4 Apr

8:51 a.m.

2011/4/4 Roel van Dijk

...

Hello,

The Haskell 2010 language specification states that "Haskell uses the Unicode character set" [1]. I interpret this as saying that, at the lowest level, a Haskell program is a sequence of Unicode code points. The standard doesn't say how such a sequence should be encoded. You can argue that the encoding of source files is not part of the language. But I think it would be highly practical to standardise on an encoding scheme.

Strictly speaking it is not possible to reliably exchange Haskell source files on the byte level. If I download some package from hackage I can't tell how the source files are encoded from just looking at the files.

Not from looking with your eyes perhaps. Does that matter? Your text editor, and the compiler, can surely figure it out for themselves. There aren't many Unicode encoding formats, and there aren't very many possibilities for the leading characters of a Haskell source file, are there?

Roel van Dijk

9:50 a.m.

2011/4/4 Colin Adams :

...

Not from looking with your eyes perhaps. Does that matter? Your text editor, and the compiler, can surely figure it out for themselves. I am not aware of any algorithm that can reliably infer the character encoding used by just looking at the raw data. Why would people bother with stuff like <?xml version="1.0" encoding="UTF-8"?> if automatically figuring out the encoding was easy?

...

There aren't many Unicode encoding formats From casually scanning some articles about encodings I can count at least 70 character encodings [1].

...

and there aren't very many possibilities for the leading characters of a Haskell source file, are there? Since a Haskell program is a sequence of Unicode code points the programmer can choose from up to 1,112,064 characters. Many of these can legitimately be part of the interface of a module, as function names, operators or names of types.

[1] - http://en.wikipedia.org/wiki/Character_encoding

Michael Snoyman

10:22 a.m.

Firstly, I personally would love to insist on using UTF-8 and be done with it. I see no reason to bother with other character encodings. 2011/4/4 Roel van Dijk

...

2011/4/4 Colin Adams :

...
Not from looking with your eyes perhaps. Does that matter? Your text editor, and the compiler, can surely figure it out for themselves. I am not aware of any algorithm that can reliably infer the character encoding used by just looking at the raw data. Why would people bother with stuff like <?xml version="1.0" encoding="UTF-8"?> if automatically figuring out the encoding was easy?

There *is* an algorithm for determining the encoding of an XML file based on a combination of the BOM (Byte Order Marker) and an assumption that the file will start with a XML declaration (i.e., <?xml ... ?>). But this isn't capable of determining every possible encoding on the planet, just distinguishing amongst varieties of UTF-(8|16|32)/(big|little) endian and EBCIDC. It cannot tell the difference between UTF-8, Latin-1, and Windows-1255 (Hebrew), for example.

...

...
There aren't many Unicode encoding formats From casually scanning some articles about encodings I can count at least 70 character encodings [1].

I think the implication of "Unicode encoding formats" is something in the UTF family. An encoding like Latin-1 or Windows-1255 can be losslessly translated into Unicode codepoints, but it's not exactly an encoding of Unicode, but rather a subset of Unicode.

...

...
and there aren't very many possibilities for the leading characters of a Haskell source file, are there? Since a Haskell program is a sequence of Unicode code points the programmer can choose from up to 1,112,064 characters. Many of these can legitimately be part of the interface of a module, as function names, operators or names of types.

My guess is that a large subset of Haskell modules start with one of left brace (starting with comment or language pragma), m (for starting with module), or some whitespace character. So it *might* be feasible to take a guess at things. But as I said before: I like UTF-8. Is there anyone out there who has a compelling reason for writing their Haskell source in EBCDIC?

Michael

Roel van Dijk

10:31 a.m.

On 4 April 2011 12:22, Michael Snoyman wrote:

...

Firstly, I personally would love to insist on using UTF-8 and be done with it. I see no reason to bother with other character encodings.

This is also my preferred choice.

...

There *is* an algorithm for determining the encoding of an XML file based on a combination of the BOM (Byte Order Marker) and an assumption that the file will start with a XML declaration (i.e., <?xml ... ?>). But this isn't capable of determining every possible encoding on the planet, just distinguishing amongst varieties of UTF-(8|16|32)/(big|little) endian and EBCIDC. It cannot tell the difference between UTF-8, Latin-1, and Windows-1255 (Hebrew), for example.

I think I was confused between character encodings in general and Unicode encodings.

...

I think the implication of "Unicode encoding formats" is something in the UTF family. An encoding like Latin-1 or Windows-1255 can be losslessly translated into Unicode codepoints, but it's not exactly an encoding of Unicode, but rather a subset of Unicode.

That would validate Colin's point about there not being that many encodings.

...

My guess is that a large subset of Haskell modules start with one of left brace (starting with comment or language pragma), m (for starting with module), or some whitespace character. So it *might* be feasible to take a guess at things. But as I said before: I like UTF-8. Is there anyone out there who has a compelling reason for writing their Haskell source in EBCDIC?

I think I misinterpreted the word 'leading'. I thought Colin meant "most used". The set of characters with which Haskell programmes start is indeed small. But like you I prefer no guessing and just default to UTF-8.

Felipe Almeida Lessa

11:38 a.m.

2011/4/4 Roel van Dijk :

...

On 4 April 2011 12:22, Michael Snoyman wrote:

...
Firstly, I personally would love to insist on using UTF-8 and be done with it. I see no reason to bother with other character encodings.

This is also my preferred choice.

+1 I'm also in favor of sticking with UTF-8 and being done with it. All of Hackage *today* is UTF-8 (ASCII included), why open a can of worms? Also, this means that we would be standardizing the current practice. Cheers, -- Felipe.

Ketil Malde

12:47 p.m.

Michael Snoyman writes:

...

My guess is that a large subset of Haskell modules start with one of left brace (starting with comment or language pragma), m (for starting with module), or some whitespace character. So it *might* be feasible to take a guess at things. But as I said before: I like UTF-8. Is there anyone out there who has a compelling reason for writing their Haskell source in EBCDIC?

Probably not EBCDIC. :-) Correct me if I'm wrong here, but I think nobody has compelling reasons for using any other Unicode format than UTF-8. Although some systems use UTF-16 (or some approximation thereof) internally, UTF-8 seems to be the universal choice external encoding. However, there probably exists a bit of code using Latin-1 and Windows charsets, and here leading characters aren't going to help you all that much. I think the safest thing to do is to require source to be ASCII, and provide escapes for code points >127... -k -- If I haven't seen further, it is by standing in the footprints of giants

Max Rabkin

1:41 p.m.

2011/4/4 Ketil Malde :

...

I think the safest thing to do is to require source to be ASCII, and provide escapes for code points >127...

I used to think that until I realised it meant having -- Author: Ma\xef N\xe5me In code, single characters aren't bad (does Haskell have something like Python's named escapes ("\N{small letter a with ring}"?) but reading UI strings is less fun. Also, unicode symbols for -> and the like are becoming more popular. --Max

Roel van Dijk

1:45 p.m.

2011/4/4 Ketil Malde :

...

I think the safest thing to do is to require source to be ASCII, and provide escapes for code points >127...

I do not think that that is the safest option. The safest is just writing down whatever GHC does. Escape codes for non-ASCII would break a lot of packages and make programming really painful. Consider the following, utf-8 encoded, file: http://code.haskell.org/numerals/test/Text/Numeral/Language/ZH/TestData.hs I don't want to imagine writing that with escape characters. It would also be very error prone, not being able to readily read what you write. But the overall consensus appears to be UTF-8 as the default encoding. I will write an official proposal to amend the haskell language specification. (Probably this evening, utc+1).

Brandon Moore

2:24 p.m.

...

From: Michael Snoyman Sent: Mon, April 4, 2011 5:22:02 AM

Firstly, I personally would love to insist on using UTF-8 and be done with it. I

see no reason to bother with other character encodings.

If by "insist", you mean the standard insist that implementations support UTF-8 by default. The rest of the standard already just talks about sequences of unicode characters, so I don't see much to be gained by prohibiting other encodings. In particular, I have read that systems set up for east asian scripts often use UTF-16 as a default encoding. Brandon

Colin Adams

2:52 p.m.

2011/4/4 Brandon Moore

...

The rest of the standard already just talks about sequences of unicode characters, so I don't see much to be gained by prohibiting other encodings.

In particular, I have read that systems set up for east asian scripts often use UTF-16 as a default encoding.

Presumably because this will use less disk space on average. I too don't see any reason to forbid other Unicode encodings. Perhaps mandate support for UTF-8, and allow others with a pragma. But unless someone adds support to a Haskell compiler for such a pragma, it will be fairly pointless putting this in the standard. -- Colin Adams Preston, Lancashire, ENGLAND () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

Michael Snoyman

3 p.m.

2011/4/4 Brandon Moore :

...

...
From: Michael Snoyman Sent: Mon, April 4, 2011 5:22:02 AM

Firstly, I personally would love to insist on using UTF-8 and be done with it. I

see no reason to bother with other character encodings.

If by "insist", you mean the standard insist that implementations support UTF-8 by default.

No, I mean that compliant compilers should only support UTF-8. I don't see a reason to allow the creation of Haskell files that can only be read by some compilers.

...

The rest of the standard already just talks about sequences of unicode characters, so I don't see much to be gained by prohibiting other encodings.

In particular, I have read that systems set up for east asian scripts often use UTF-16 as a default encoding.

I don't know about that, but I'd be very surprised if there are any editors out there that don't support UTF-8. If a user is inconvenienced once because he/she needs to change the default encoding to UTF-8, and the result is all Haskell files share the same encoding, I'm OK with that. @Colin: Even if UTF-16 was more space-efficient than UTF-8 (which I highly doubt[1]), I'd be incredibly surprised if this held true for Haskell source, which will almost certainly be at least 90% code-points below 128. For those code points, UTF-16 is twice the size as UTF-8. Michael [1] http://www.haskell.org/pipermail/haskell-cafe/2010-August/082268.html

Yitzchak Gale

3:05 p.m.

+1 for UTF-8 only. Brandon Moore wrote:

...

...I don't see much to be gained by prohibiting other encodings.

Universal portability of Haskell source code with respect to its encoding is to be gained. We can achieve that simplicity now with almost no cost. Why miss the opportunity?

...

In particular, I have read that systems set up for east asian scripts often use UTF-16 as a default encoding.

Default encoding is not an issue for any normal source code editing tool. Thanks, Yitz

Daniel Fischer

10:24 a.m.

On Monday 04 April 2011 11:50:03, Roel van Dijk wrote:

...

...
and there aren't very many possibilities for the leading characters of a Haskell source file, are there?

Since a Haskell program is a sequence of Unicode code points the programmer can choose from up to 1,112,064 characters. Many of these can legitimately be part of the interface of a module, as function names, operators or names of types.

Colin spoke of *leading* characters, for .hs files, that drastically reduces the possibilities - not for .lhs, though.

Richard O'Keefe

5 Apr 5 Apr

2:35 a.m.

On 4/04/2011, at 10:24 PM, Daniel Fischer wrote:

...

Colin spoke of *leading* characters, for .hs files, that drastically reduces the possibilities - not for .lhs, though.

A .hs file can, amongst other things, begin with any "small" letter.

Daniel Fischer

9:35 a.m.

On Tuesday 05 April 2011 04:35:39, Richard O'Keefe wrote:

...

On 4/04/2011, at 10:24 PM, Daniel Fischer wrote:

...
Colin spoke of *leading* characters, for .hs files, that drastically reduces the possibilities - not for .lhs, though.

A .hs file can, amongst other things, begin with any "small" letter.

D'oh, yes, I always forget that a module declaration isn't required.

Colin Adams

5:58 p.m.

On 5 April 2011 10:35, Daniel Fischer wrote:

...

On Tuesday 05 April 2011 04:35:39, Richard O'Keefe wrote:

...
On 4/04/2011, at 10:24 PM, Daniel Fischer wrote:

...
Colin spoke of *leading* characters, for .hs files, that drastically reduces the possibilities - not for .lhs, though.

A .hs file can, amongst other things, begin with any "small" letter.

D'oh, yes, I always forget that a module declaration isn't required.

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

True, but we could say that UTF-8 is complusory in the absence of a module declaration. -- Colin Adams Preston, Lancashire, ENGLAND () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

Herbert Valerio Riedel

7:22 a.m.

On Mon, 2011-04-04 at 11:50 +0200, Roel van Dijk wrote:

...

I am not aware of any algorithm that can reliably infer the character encoding used by just looking at the raw data. Why would people bother with stuff like <?xml version="1.0" encoding="UTF-8"?> if automatically figuring out the encoding was easy?

It is possible, if the syntax/grammar of the encoded content restricts the set of allowed code-points in the first few characters. For instance, valid JSON (see RFC 4673 section 3) requires the first two characters to be plain "ASCII" code-points, thus which of the 5 BOM-less UTF-encodings is used is uniquely determined by inspecting the first 4 bytes of the UTF encoded stream.

Daniel Fischer

4 Apr 4 Apr

10:34 a.m.

On Monday 04 April 2011 10:46:46, Roel van Dijk wrote:

...

I propose a few solutions:

A - Choose a single encoding for all source files.

This is wat GHC does: "GHC assumes that source files are ASCII or UTF-8 only, other encodings are not recognised" [2]. UTF-8 seems like a good candidate for such an encoding.

If there's only a single encoding recognised, UTF-8 surely should be the one (though perhaps Windows users might disagree, iirc, Windows uses UCS2 as standard encoding).

...

B - Specify encoding in the source files.

Start each source file with a special comment specifying the encoding used in that file. See Python for an example of this mechanism in practice [3]. It would be nice to use already existing facilities to specify the encoding, for example: {-# ENCODING <encoding name> #-}

An interesting idea in the Python PEP is to also allow a form recognised by most text editors: # -*- coding: <encoding name> -*-

C - Option B + Default encoding

Like B, but also choose a default encoding in case no specific encoding is specified.

default = UTF-8 Laziness makes me prefer that over B.

...

I would further like to propose to specify the encoding of haskell source files in the language standard. Encoding of source files belongs somewhere between a language specification and specific implementations. But the language standard seems to be the most practical place.

I'd agree.

...

This is not an official proposal. I am just interested in what the Haskell community has to say about this.

Regards, Roel

Max Bolingbroke

12:30 p.m.

On 4 April 2011 11:34, Daniel Fischer wrote:

...

If there's only a single encoding recognised, UTF-8 surely should be the one (though perhaps Windows users might disagree, iirc, Windows uses UCS2 as standard encoding).

Windows APIs use UTF-16, but the encoding of files (which is the relevant point here) is almost uniformly UTF-8 - though of course you can find legacy apps making other choices. Cheers, Max

Steve Schafer

12:47 p.m.

On Mon, 4 Apr 2011 13:30:08 +0100, you wrote:

...

Windows APIs use UTF-16...

The newer ones, at least. The older ones usually come in two flavors, UTF-16LE and 8-bit code page-based.

...

...but the encoding of files (which is the relevant point here) is almost uniformly UTF-8 - though of course you can find legacy apps making other choices.

If you're talking about files written and read by the operating system itself, then perhaps. But my experience is that there are a lot of applications that use UTF-16LE, especially ones that typically only work with smaller files (configuration files, etc.). As for Haskell, I would still vote for UTF-8 only, though. The only reason to favor anything else is legacy compatibility with existing Haskell source files, and that isn't really an issue here. -Steve Schafer

Antoine Latter

1:09 p.m.

On Mon, Apr 4, 2011 at 7:30 AM, Max Bolingbroke wrote:

...

On 4 April 2011 11:34, Daniel Fischer wrote:

...
If there's only a single encoding recognised, UTF-8 surely should be the one (though perhaps Windows users might disagree, iirc, Windows uses UCS2 as standard encoding).

Windows APIs use UTF-16, but the encoding of files (which is the relevant point here) is almost uniformly UTF-8 - though of course you can find legacy apps making other choices.

Would we need to specifically allow for a Windows-style leading BOM in UTF-8 documents? I can never remember if it is truly a part of UTF-8 or not.

...

Cheers, Max

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

malcolm.wallace

1:18 p.m.

BOM is not part of UTF8, because UTF8 is byte-oriented. But applications should be prepared to read and discard it, because some applications erroneously generate it. Regards, Malcolm On 04 Apr, 2011,at 02:09 PM, Antoine Latter wrote: On Mon, Apr 4, 2011 at 7:30 AM, Max Bolingbroke wrote:

...

On 4 April 2011 11:34, Daniel Fischer wrote:

...
If there's only a single encoding recognised, UTF-8 surely should be the one (though perhaps Windows users might disagree, iirc, Windows uses UCS2 as standard encoding).

Windows APIs use UTF-16, but the encoding of files (which is the relevant point here) is almost uniformly UTF-8 - though of course you can find legacy apps making other choices.

Would we need to specifically allow for a Windows-style leading BOM in UTF-8 documents? I can never remember if it is truly a part of UTF-8 or not.

...

Cheers, Max

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Tako Schotanus

1:38 p.m.

That's not what the official unicode site says in its FAQ: http://unicode.org/faq/utf_bom.html#bom4 and http://unicode.org/faq/utf_bom.html#bom http://unicode.org/faq/utf_bom.html#bom4 5 Cheers, -Tako On Mon, Apr 4, 2011 at 15:18, malcolm.wallace wrote:

...

BOM is not part of UTF8, because UTF8 is byte-oriented. But applications should be prepared to read and discard it, because some applications erroneously generate it.

Regards, Malcolm

On 04 Apr, 2011,at 02:09 PM, Antoine Latter wrote:

On Mon, Apr 4, 2011 at 7:30 AM, Max Bolingbroke wrote:

...
On 4 April 2011 11:34, Daniel Fischer wrote:

...
If there's only a single encoding recognised, UTF-8 surely should be the one (though perhaps Windows users might disagree, iirc, Windows uses UCS2 as standard encoding).

Windows APIs use UTF-16, but the encoding of files (which is the relevant point here) is almost uniformly UTF-8 - though of course you can find legacy apps making other choices.

Would we need to specifically allow for a Windows-style leading BOM in UTF-8 documents? I can never remember if it is truly a part of UTF-8 or not.

...
Cheers, Max

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Yitzchak Gale

3:51 p.m.

malcolm.wallace wrote:

...

...
BOM is not part of UTF8, because UTF8 is byte-oriented. But applications should be prepared to read and discard it, because some applications erroneously generate it.

For maximum portability, the standard should be require compilers to accept and discard an optional BOM as the first character of a source code file. Tako Schotanus wrote:

...

That's not what the official unicode site says in its FAQ: http://unicode.org/faq/utf_bom.html#bom4 and http://unicode.org/faq/utf_bom.html#bom5

That FAQ clearly states that BOM is part of some "protocols". It carefully avoids stating whether it is part of the encoding. It is certainly not erroneous to include the BOM if it is part of the protocol for the applications being used. Applications can include whatever characters they'd like, and they can use whatever handshake mechanism they'd like to agree upon an encoding. The BOM mechanism is common on the Windows platform. It has since appeared in other places as well, but it is certainly not universally adopted. Python supports a pseudo-encoding called "utf8-bom" that automatically generates and discards the BOM in support of that handshake mechanism But it isn't really an encoding, it's a convenience. Part of the source of all this confusion is some documentation that appeared in the past on Microsoft's site which was unclear about the fact that the BOM handshake is a protocol adopted by Microsoft, not a part of the encoding itself. Some people claim that this was intentional, part of the "extend and embrace" tactic Microsoft allegedly employed in those days in an effort to expand its monopoly. The wording of the Unicode FAQ is obviously trying to tip-toe diplomatically around this issue without arousing the ire of either pro-Microsoft or anti-Microsoft developers. Thanks, Yitz

Roel van Dijk

10:52 p.m.

I made an official proposal on the haskell-prime list: http://www.haskell.org/pipermail/haskell-prime/2011-April/003368.html Let's have further discussion there.

Mark Lentczner

5 Apr 5 Apr

5:04 a.m.

On Mon, Apr 4, 2011 at 3:52 PM, Roel van Dijk wrote:

...

I made an official proposal on the haskell-prime list:

http://www.haskell.org/pipermail/haskell-prime/2011-April/003368.html

Let's have further discussion there.

I'm not on that mailing list, so I'll comment here: My only caveat is that the encoding provision should apply when Haskell source is presented to the compiler as a bare stream of octets. Where Haskell source is interchanged as a stream of Unicode characters, then encoding is not relevant -- but may be likely governed by some outer protocol - and hence may not be UTF-8 but nonetheless invisible at the Haskell level. Two examples where this might come into play are: 1) An IDE that stores module source in some database. It would not be relevant what encoding that IDE and database choose to store the source in if the source is presented to the integrated compiler as Unicode characters. 2) If a compilation system fetches module source via HTTP (I could imagine a compiler that chased down included modules directly off of Hackage, say), then HTTP already has a mechanism (via MIME types) of transmitting the encoding clearly. As such, there should be no problem if that outer protocol (HTTP) transmits the source to the compiler via some other encoding. There is no reason (and only potential interoperability restrictions) to enforce that UTF-8 be the only legal encoding here.

Roel van Dijk

7:34 a.m.

On 5 April 2011 07:04, Mark Lentczner wrote:

...

I'm not on that mailing list, so I'll comment here:

I recommend joining the prime list. It is very low traffic and the place where language changes should be discussed.

...

My only caveat is that the encoding provision should apply when Haskell source is presented to the compiler as a bare stream of octets. Where Haskell source is interchanged as a stream of Unicode characters, then encoding is not relevant -- but may be likely governed by some outer protocol - and hence may not be UTF-8 but nonetheless invisible at the Haskell level.

My intention is that every time you need an encoding for Haskell sources, it must be UTF-8. At least if you want to call it Haskell. This is not limited to compilers but concerns all tools that process Haskell sources.

...

Two examples where this might come into play are: 1) An IDE that stores module source in some database. It would not be relevant what encoding that IDE and database choose to store the source in if the source is presented to the integrated compiler as Unicode characters.

An IDE and database are free to store sources any way they see fit. But as soon as you want to exchange that source with some standards conforming system it must be encoded as UTF-8.

...

2) If a compilation system fetches module source via HTTP (I could imagine a compiler that chased down included modules directly off of Hackage, say), then HTTP already has a mechanism (via MIME types) of transmitting the encoding clearly. As such, there should be no problem if that outer protocol (HTTP) transmits the source to the compiler via some other encoding. There is no reason (and only potential interoperability restrictions) to enforce that UTF-8 be the only legal encoding here.

This is an interesting example. What distinguishes this scenario from others is that there is a clear understanding between two parties (client and server) how a file should be interpreted. I could word my proposal in such a way that it only concerns situations where such a prior agreement doesn't or can't exist. For example, when storing source on a file system.

Tako Schotanus

1:23 a.m.

On Mon, Apr 4, 2011 at 17:51, Yitzchak Gale wrote:

...

malcolm.wallace wrote:

...
...
BOM is not part of UTF8, because UTF8 is byte-oriented. But applications should be prepared to read and discard it, because some applications erroneously generate it.

For maximum portability, the standard should be require compilers to accept and discard an optional BOM as the first character of a source code file.

Tako Schotanus wrote:

...
That's not what the official unicode site says in its FAQ: http://unicode.org/faq/utf_bom.html#bom4 and http://unicode.org/faq/utf_bom.html#bom5

That FAQ clearly states that BOM is part of some "protocols". It carefully avoids stating whether it is part of the encoding.

It is certainly not erroneous to include the BOM if it is part of the protocol for the applications being used. Applications can include whatever characters they'd like, and they can use whatever handshake mechanism they'd like to agree upon an encoding. The BOM mechanism is common on the Windows platform. It has since appeared in other places as well, but it is certainly not universally adopted.

Python supports a pseudo-encoding called "utf8-bom" that automatically generates and discards the BOM in support of that handshake mechanism But it isn't really an encoding, it's a convenience.

Part of the source of all this confusion is some documentation that appeared in the past on Microsoft's site which was unclear about the fact that the BOM handshake is a protocol adopted by Microsoft, not a part of the encoding itself. Some people claim that this was intentional, part of the "extend and embrace" tactic Microsoft allegedly employed in those days in an effort to expand its monopoly.

The wording of the Unicode FAQ is obviously trying to tip-toe diplomatically around this issue without arousing the ire of either pro-Microsoft or anti-Microsoft developers.

Some reliable sources for all this would be entertaining (although irrelevant for the rest of this discussion). Cheers, -Tako

5204

Age (days ago)

5205

Last active (days ago)

List overview

Download

28 comments

17 participants

participants (17)

Antoine Latter
Brandon Moore
Colin Adams
Daniel Fischer
Felipe Almeida Lessa
Herbert Valerio Riedel
Ketil Malde
malcolm.wallace
Mark Lentczner
Max Bolingbroke
Max Rabkin
Michael Snoyman
Richard O'Keefe
Roel van Dijk
Steve Schafer
Tako Schotanus
Yitzchak Gale