
I've been putting together a proposal for Unicode identifiers in Erlang (it's EEP 40 if anyone wants to look it up). In the course of this, it has turned out that there is a technical problem for languages with case-significant identifiers. Haskell 2010 report, chapter 2. http://www.haskell.org/onlinereport/haskell2010/haskellch2.html varid → (small {small | large | digit | ' })\⟨reservedid⟩ conid → large {small | large | digit | ' } small → ascSmall | uniSmall | _ ascSmall → a | b | … | z uniSmall → any Unicode lowercase letter large → ascLarge | uniLarge ascLarge → A | B | … | Z uniLarge → any uppercase or titlecase Unicode letter This is actually ambiguous: any ascSmall is also a uniSmall and any ascLarge is also a uniLarge. I take it that this is intended to mean "any Unicode xxx letter other than an ASCII one" in each case. That's not the problem. The definition currently bans Hebrew, Arabic, Chinese, Japanese, all the Indic scripts, and basically only allows Latin, Greek, Coptic, Cyrillic, Glagolitic, Armenian, arguably Georgian, and Deseret (but not Shavian). That's not the problem either. The problem is that being a Unicode lower case, upper case, or title case letter is not a stable property. Unicode annex UAX#31 guarantees that X is a well-formed case-insensitive identifier now => X will always be a well-formed case-insensitive identifier and that X is a well-formed case-sensitive identifier now => X will always be a well-formed case-sensitive identifier What it does NOT guarantee is that it will continue to be begin with the same *case* or even that a letter will continue to be classified as a letter. So it is at least technically possible for a valid Haskell 2010 varid (conid) to turn into a conid (varid) or even cease to be a legal Haskell identifier at all. Unicode standard Annex UAX#31 guarantees stability of being-an-identifier by having an exceptional set for any letter that stops being a letter to go into. For example, there are SCRIPT CAPITAL {B,E,F,H,I,L,M,P,R} characters, all of which are capital letters except for SCRIPT CAPITAL P, which is a symbol, but it's in the exception set so it's still OK to use. All of the SCRIPT CAPITAL letters were in General Category So in Unicode 1.1.5 (the earliest for which online data is available). In Unicode 2.1.8, all of them were Lu except for SCRIPT CAPITAL P, which was Ll. By Unicode 3.0.0, SCRIPT CAPITAL P was back to So. Some time later it switched over to Sm. So we've had SCRIPT CAPITAL P - not a letter (1.1.5) - is a lower case letter (2.1.8) - not a letter again (3.0.0) at least according to the on-line UnicodeData-<version>.txt files. Putting ℘ into the exceptional set means that a UAX#31 identifier may still contain it, but not so a Haskell one. There are two aspects to this instability. (1) Because Haskell hews its own line instead of tailoring UAX#31 the way Ada and Python do, Haskell cannot benefit from the UAX#31 stability guarantee. There _has_ been a character that used to be legal in a Haskell identifier that is not now. That's Haskell's problem, not Unicode's, and the Haskell community does not have to wait for anyone else to address is. (2) Even if you adopt one of the UAX#31 definitions verbatim, the case distinction Haskell needs to make is not stable. It appears that nobody who worked on UAX#31 was thinking about languages like Prolog, Erlang, Clean, Haskell, F#, or Scala, and that if the Unicode Consortium are told of the problem, they will probably be happy to add some sort of "don't break these languages" guideline. Next week I intend to submit a proposal to the Unicode consortium to consider this issue. Would anyone care to see and comment on the proposal before I send it to Unicode.org? Anyone got any suggestions before I begin to write it? For the sake of argument, suppose that we are going to stick with Xid_Start Xid_Continue* for the union of variables and atoms (which is pretty much what Ada and Python do), and the sole issue of concern is that there should be a stable way to classify such a token as "beginning with default case" or "beginning with marked case".