Proposal: Define UTF-8 to be the encoding of Haskell source files

Per the Haskell Prime process I would like to make an official proposal [1]. * Proposal The Haskell 2010 language specification states that: "Haskell uses the Unicode character set" [2]. It does not state what encoding should be used. This means, strictly speaking, it is not possible to reliably exchange Haskell source files on the byte level. I propose to make UTF-8 the only allowed encoding for Haskell source files. Implementations must discard an initial Byte Order Mark (BOM) if present [3]. * Pros - Ensures that Haskell source can be reliably exchanged on the byte level. - Disallows implicit ISO-8859-* encodings in source code, ensuring portability. - Little or no implementation burden for compiler writers. * Cons - Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion. (Only relevant for Hugs-only code). * Implementation status ** GHC "GHC assumes that source files are ASCII or UTF-8 only, other encodings are not recognised. However, invalid UTF-8 sequences will be ignored in comments, so it is possible to use other encodings such as Latin-1, as long as the non-comment source code is ASCII only." [4]
From this I deduce that all current code accepted by GHC is compatible with UTF-8. No working code will be broken.
** JHC "JHC allows unrestricted use of the Unicode character set in Haskell source, treating input as UTF-8." [5] ** Hugs Hugs treats input as being in the encoding specified by the current locale, but permits Unicode only in comments and character and string literals. [6] * Related proposal There is one, 5 year old, proposal that is related: "SourceEncodingDetection" [5]. There it is proposed to detect the encoding using an algorithm which can distinguish between UTF-8, UTF-16 and (not always) UTF-32. It can also detect the endianness of the document, if applicable. I think choosing just UTF-8 is a better choice than a detection algorithm. It places less burden on implementation writers and is even more portable. * Next step Discussion! There was already some discussion on the haskell-cafe mailing list [7]. Attached is a patch for the Haskell Report which adds a note stating that source encodings must be UTF-8. Regards, Roel van Dijk [1] - http://hackage.haskell.org/trac/haskell-prime/wiki/Process [2] - http://www.haskell.org/onlinereport/haskell2010/haskellch2.html#x7-150002.1 [3] - http://www.unicode.org/faq/utf_bom.html#bom5 [4] - http://www.haskell.org/ghc/docs/7.0-latest/html/users_guide/separate-compila... [5] - http://hackage.haskell.org/trac/haskell-prime/wiki/SourceEncodingDetection [6] - http://cvs.haskell.org/Hugs/pages/users_guide/locale.html [7] - http://article.gmane.org/gmane.comp.lang.haskell.cafe/87815

Hi, Roel van Dijk wrote:
I propose to make UTF-8 the only allowed encoding for Haskell source files. Implementations must discard an initial Byte Order Mark (BOM) if present [3].
How would that affect the non-code parts of literate Haskell (*.lhs) files? In particular, would it place any burden on third-party tools processing these files? Tillmann

Roel van Dijk wrote:
I propose to make UTF-8 the only allowed encoding for Haskell source files. Implementations must discard an initial Byte Order Mark (BOM) if present
I am in favor of this proposal. However, you wrote:
"GHC assumes that source files are ASCII or UTF-8 only, other encodings are not recognised. However, invalid UTF-8 sequences will be ignored in comments, so it is possible to use other encodings such as Latin-1, as long as the non-comment source code is ASCII only." [4]
From this I deduce that all current code accepted by GHC is compatible with UTF-8. No working code will be broken.
No. If GHC is changed to conform to this proposal, source code including invalid UTF-8 in comments which previously compiled successfully will now be rejected. But anyway I think allowing invalid UTF-8 in comments is a mistake. It could lead to the end of the comment being detected in the wrong place, thus changing the meaning of the program in very unexpected ways. Not likely, but possible. I doubt that there is a whole lot of code out there which would be affected. And GHC can easily provide a certain degree of backward compatibility with a flag and/or pragma. Thanks, Yitz

On 4 April 2011 23:48, Roel van Dijk
* Proposal
The Haskell 2010 language specification states that: "Haskell uses the Unicode character set" [2]. It does not state what encoding should be used. This means, strictly speaking, it is not possible to reliably exchange Haskell source files on the byte level.
I propose to make UTF-8 the only allowed encoding for Haskell source files. Implementations must discard an initial Byte Order Mark (BOM) if present [3].
* Next step
Discussion! There was already some discussion on the haskell-cafe mailing list [7].
This is a simple and obviously sensible proposal. I'm certainly in favour. I think the only area where there might be some issue to discuss is the language of the report. As far as I can see, the report does not require that modules exist as files, does not require the ".hs" extension and does not give the "standard" mapping from module name to file name. So since the goal is interoperability of source files then perhaps we should also have a section somewhere with interoperability guidelines for implementations that do store Haskell programs as OS files. The section would describe the one module per file convention, the .hs extension (this is already obliquely mentioned in the section on literate Haskell syntax) and the mapping of module names to file names in common OS file systems. Then this UTF8 stipulation could go there (and it would be clear that it applies only to conventional implementations that store Haskell programs as files). e.g. Interoperability Guidelines ======================== This Report does not specify how Haskell programs are represented or stored. There is however a conventional representation using OS files. Implementations that conform to these guidelines will benefit from the portability of Haskell program representations. Haskell modules are stored as files, one module per file. These Haskell source files are given the file extension ".hs" for usual Haskell files and ".lhs" for literate Haskell files (see section 10.4). Source files must be encoded as UTF-8 \cite{utf8}. Implementations must discard an initial Byte Order Mark (BOM) if present. To find a source file corresponding to a module name used in an import declaration, the following mapping from module name to OS file name is used. The '.' character is mapped to the OS's directory separator string while all other characters map to themselves. The ".hs" or ".lhs" extension is added. Where both ".hs" and ".lhs" files exist for the same module, the ".lhs" one should be used. The OS's standard convention for representing Unicode file names should be used. For example, on a UNIX based OS, the module A.B would map to the file name "A/B.hs" for a normal Haskell file or to "A/B.lhs" for a literate Haskell file. Note that because it is rare for a Main module to be imported, there is no restriction on the name of the file containing the Main module. It is conventional, but not strictly necessary, that the Main module use the ".hs" or ".lhs" extension. Duncan

On Wed, Apr 6, 2011 at 2:13 PM, Duncan Coutts
Interoperability Guidelines ========================
[...]
To find a source file corresponding to a module name used in an import declaration, the following mapping from module name to OS file name is used. The '.' character is mapped to the OS's directory separator string while all other characters map to themselves. The ".hs" or ".lhs" extension is added. Where both ".hs" and ".lhs" files exist for the same module, the ".lhs" one should be used. The OS's standard convention for representing Unicode file names should be used.
This standard isn't quite universal. For example, jhc will look for Data.Foo in Data/Foo.hs but also Data.Foo.hs [1]. We could take this as an opportunity to discuss that practice, or we could try to make the changes to the report orthogonal to that issue. In some sense I think it's cute that the Report doesn't specify anything about how Haskell modules are stored or represented, but I don't think that freedom is actually used, so I'm happy to see it go. I'd think, though, that in that case there would be more to discuss than just the encoding, so if we could separate out the issues here, I think that would be useful. [1]: http://repetae.net/computer/jhc/manual.html#module-search-path

On Wed, 2011-04-06 at 16:09 +0100, Ben Millwood wrote:
On Wed, Apr 6, 2011 at 2:13 PM, Duncan Coutts
wrote: Interoperability Guidelines ========================
[...]
To find a source file corresponding to a module name used in an import declaration, the following mapping from module name to OS file name is used. The '.' character is mapped to the OS's directory separator string while all other characters map to themselves. The ".hs" or ".lhs" extension is added. Where both ".hs" and ".lhs" files exist for the same module, the ".lhs" one should be used. The OS's standard convention for representing Unicode file names should be used.
This standard isn't quite universal. For example, jhc will look for Data.Foo in Data/Foo.hs but also Data.Foo.hs [1]. We could take this as an opportunity to discuss that practice, or we could try to make the changes to the report orthogonal to that issue.
Indeed. But it's true to say that if you do support the common convention then you get portability. This does not preclude JHC from supporting something extra, but sources that take advantage of JHC's extension are not portable to implementations that just use the common convention.
In some sense I think it's cute that the Report doesn't specify anything about how Haskell modules are stored or represented, but I don't think that freedom is actually used, so I'm happy to see it go. I'd think, though, that in that case there would be more to discuss than just the encoding, so if we could separate out the issues here, I think that would be useful.
It's not going. I hope I was clear in the example text that the interoperability guidelines were not forcing implementations to use files etc, just that if they do, if they uses these conventions then sources will be portable between implementations. It doesn't stop an implementation using URLs, sticking multiple modules in a file or keeping modules in a database. Duncan

On 6 April 2011 15:13, Duncan Coutts
So since the goal is interoperability of source files then perhaps we should also have a section somewhere with interoperability guidelines for implementations that do store Haskell programs as OS files.
I think a set of interoperability guidelines is a great idea. It seems these guidelines are already followed by GHC, Cabal, Hackage, Jhc and possibly others. Shall we consider this the proposal instead of just the encoding part?

On Thu, 2011-04-07 at 09:07 +0200, Roel van Dijk wrote:
On 6 April 2011 15:13, Duncan Coutts
wrote: So since the goal is interoperability of source files then perhaps we should also have a section somewhere with interoperability guidelines for implementations that do store Haskell programs as OS files.
I think a set of interoperability guidelines is a great idea. It seems these guidelines are already followed by GHC, Cabal, Hackage, Jhc and possibly others.
Shall we consider this the proposal instead of just the encoding part?
I would be happy to work with you and others to develop the report text for such a proposal. I posted my first draft already :-) Duncan

On 7 April 2011 14:11, Duncan Coutts
I would be happy to work with you and others to develop the report text for such a proposal. I posted my first draft already :-)
What would be a good way to proceed? Looking at the process I think we should create a wiki page and a ticket for this proposal. If necessary I'll volunteer to be the proposal owner.

On Thu, 2011-04-07 at 15:44 +0200, Roel van Dijk wrote:
On 7 April 2011 14:11, Duncan Coutts
wrote: I would be happy to work with you and others to develop the report text for such a proposal. I posted my first draft already :-)
What would be a good way to proceed? Looking at the process I think we should create a wiki page and a ticket for this proposal. If necessary I'll volunteer to be the proposal owner.
Ok, I can give you permissions on the wiki. What is your username on the haskell-prime wiki? Duncan
participants (5)
-
Ben Millwood
-
Duncan Coutts
-
Roel van Dijk
-
Tillmann Rendel
-
Yitzchak Gale