Proposal: Define UTF-8 to be the encoding of Haskell source files

4 Apr 2011

      Per the Haskell Prime process I would like to make an official
proposal [1].

* Proposal

The Haskell 2010 language specification states that: "Haskell uses the
Unicode character set" [2]. It does not state what encoding should be
used. This means, strictly speaking, it is not possible to reliably
exchange Haskell source files on the byte level.

I propose to make UTF-8 the only allowed encoding for Haskell source
files. Implementations must discard an initial Byte Order Mark (BOM)
if present [3].

* Pros
- Ensures that Haskell source can be reliably exchanged on the byte
  level.
- Disallows implicit ISO-8859-* encodings in source code, ensuring
  portability.
- Little or no implementation burden for compiler writers.

* Cons

- Existing code relying on a non-UTF8, locale-/implementation-specific
  encoding will need conversion. (Only relevant for Hugs-only code).

* Implementation status

** GHC
"GHC assumes that source files are ASCII or UTF-8 only, other
encodings are not recognised. However, invalid UTF-8 sequences will be
ignored in comments, so it is possible to use other encodings such as
Latin-1, as long as the non-comment source code is ASCII only." [4]
...
From this I deduce that all current code accepted by GHC is compatible
with UTF-8. No working code will be broken.
** JHC
"JHC allows unrestricted use of the Unicode character set in Haskell
source, treating input as UTF-8." [5]

** Hugs
Hugs treats input as being in the encoding specified by the current
locale, but permits Unicode only in comments and character and string
literals. [6]

* Related proposal

There is one, 5 year old, proposal that is related:
"SourceEncodingDetection" [5]. There it is proposed to detect the
encoding using an algorithm which can distinguish between UTF-8,
UTF-16 and (not always) UTF-32. It can also detect the endianness of
the document, if applicable.

I think choosing just UTF-8 is a better choice than a detection
algorithm. It places less burden on implementation writers and is even
more portable.

* Next step

Discussion! There was already some discussion on the haskell-cafe
mailing list [7].

Attached is a patch for the Haskell Report which adds a note stating
that source encodings must be UTF-8.

Regards,
Roel van Dijk

[1] - http://hackage.haskell.org/trac/haskell-prime/wiki/Process
[2] - http://www.haskell.org/onlinereport/haskell2010/haskellch2.html#x7-150002.1
[3] - http://www.unicode.org/faq/utf_bom.html#bom5
[4] - http://www.haskell.org/ghc/docs/7.0-latest/html/users_guide/separate-compila...
[5] - http://hackage.haskell.org/trac/haskell-prime/wiki/SourceEncodingDetection
[6] - http://cvs.haskell.org/Hugs/pages/users_guide/locale.html
[7] - http://article.gmane.org/gmane.comp.lang.haskell.cafe/87815

Roel van Dijk

Tillmann Rendel

Yitzchak Gale

Duncan Coutts

Ben Millwood

Duncan Coutts

Roel van Dijk

Duncan Coutts

Roel van Dijk

Duncan Coutts

Roel van Dijk

tags

participants (5)