
On 4 April 2011 23:48, Roel van Dijk
* Proposal
The Haskell 2010 language specification states that: "Haskell uses the Unicode character set" [2]. It does not state what encoding should be used. This means, strictly speaking, it is not possible to reliably exchange Haskell source files on the byte level.
I propose to make UTF-8 the only allowed encoding for Haskell source files. Implementations must discard an initial Byte Order Mark (BOM) if present [3].
* Next step
Discussion! There was already some discussion on the haskell-cafe mailing list [7].
This is a simple and obviously sensible proposal. I'm certainly in favour. I think the only area where there might be some issue to discuss is the language of the report. As far as I can see, the report does not require that modules exist as files, does not require the ".hs" extension and does not give the "standard" mapping from module name to file name. So since the goal is interoperability of source files then perhaps we should also have a section somewhere with interoperability guidelines for implementations that do store Haskell programs as OS files. The section would describe the one module per file convention, the .hs extension (this is already obliquely mentioned in the section on literate Haskell syntax) and the mapping of module names to file names in common OS file systems. Then this UTF8 stipulation could go there (and it would be clear that it applies only to conventional implementations that store Haskell programs as files). e.g. Interoperability Guidelines ======================== This Report does not specify how Haskell programs are represented or stored. There is however a conventional representation using OS files. Implementations that conform to these guidelines will benefit from the portability of Haskell program representations. Haskell modules are stored as files, one module per file. These Haskell source files are given the file extension ".hs" for usual Haskell files and ".lhs" for literate Haskell files (see section 10.4). Source files must be encoded as UTF-8 \cite{utf8}. Implementations must discard an initial Byte Order Mark (BOM) if present. To find a source file corresponding to a module name used in an import declaration, the following mapping from module name to OS file name is used. The '.' character is mapped to the OS's directory separator string while all other characters map to themselves. The ".hs" or ".lhs" extension is added. Where both ".hs" and ".lhs" files exist for the same module, the ".lhs" one should be used. The OS's standard convention for representing Unicode file names should be used. For example, on a UNIX based OS, the module A.B would map to the file name "A/B.hs" for a normal Haskell file or to "A/B.lhs" for a literate Haskell file. Note that because it is rare for a Main module to be imported, there is no restriction on the name of the file containing the Main module. It is conventional, but not strictly necessary, that the Main module use the ".hs" or ".lhs" extension. Duncan