RFC: Unicode primes and super/subscript characters in GHC

Hello lists, As some of you may know, GHC's support for Unicode characters in lexemes is rather crude and hence prone to inconsistencies in their handling versus the ASCII counterparts. For example, APOSTROPHE is treated differently from PRIME: λ> data a +' b = Plus a b <interactive>:3:9: Unexpected type ‘b’ In the data declaration for ‘+’ A data declaration should have form data + a b c = ... λ> data a +′ b = Plus a b λ> let a' = 1 λ> let a′ = 1 <interactive>:10:8: parse error on input ‘=’ Also some rather bizarre looking things are accepted: λ> let ᵤxᵤy = 1 In the spirit of improving things little by little I would like to propose: 1. Handle single/double/triple/quadruple Unicode PRIMEs the same way as APOSTROPHE, meaning the following alterations to the lexer: primes -> U+2032 | U+2033 | U+2034 | U+2057 symbol -> ascSymbol | uniSymbol (EXCEPT special | _ | " | ' | primes) graphic -> small | large | symbol | digit | special | " | ' | primes varid -> (small { small | large | digit | ' | primes }) (EXCEPT reservedid) conid -> large { small | large | digit | ' | primes } 2. Introduce a new lexer nonterminal "subsup" that would include the Unicode sub/superscript[1] versions of numbers, "-", "+", "=", "(", ")", Latin and Greek letters. And allow these characters to be used in names and operators: symbol -> ascSymbol | uniSymbol (EXCEPT special | _ | " | ' | primes | subsup ) digit -> ascDigit | uniDigit (EXCEPT subsup) small -> ascSmall | uniSmall (EXCEPT subsup) | _ large -> ascLarge | uniLarge (EXCEPT subsup) graphic -> small | large | symbol | digit | special | " | ' | primes | subsup varid -> (small { small | large | digit | ' | primes | subsup }) (EXCEPT reservedid) conid -> large { small | large | digit | ' | primes | subsup } varsym -> (symbol (EXCEPT :) {symbol | subsup}) (EXCEPT reservedop | dashes) consym -> (: {symbol | subsup}) (EXCEPT reservedop) If this proposal is received favorably, I'll write a patch for GHC based on my previous stab at the problem[2]. P.S. I'm CC-ing Cafe for extra attention, but please keep the discussion to the GHC users list. [1] https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts [2] https://ghc.haskell.org/trac/ghc/ticket/5108

I have this feature in jhc, where I have a 'trailing' character class
that can appear at the end of both symbols and ids.
currently it consists of
$trailing = [₀₁₂₃₄₅₆₇₈₉⁰¹²³⁴⁵⁶⁷⁸⁹₍₎⁽⁾₊₋]
John
On Sat, Jun 14, 2014 at 7:48 AM, Mikhail Vorozhtsov
Hello lists,
As some of you may know, GHC's support for Unicode characters in lexemes is rather crude and hence prone to inconsistencies in their handling versus the ASCII counterparts. For example, APOSTROPHE is treated differently from PRIME:
λ> data a +' b = Plus a b <interactive>:3:9: Unexpected type ‘b’ In the data declaration for ‘+’ A data declaration should have form data + a b c = ... λ> data a +′ b = Plus a b
λ> let a' = 1 λ> let a′ = 1 <interactive>:10:8: parse error on input ‘=’
Also some rather bizarre looking things are accepted:
λ> let ᵤxᵤy = 1
In the spirit of improving things little by little I would like to propose:
1. Handle single/double/triple/quadruple Unicode PRIMEs the same way as APOSTROPHE, meaning the following alterations to the lexer:
primes -> U+2032 | U+2033 | U+2034 | U+2057 symbol -> ascSymbol | uniSymbol (EXCEPT special | _ | " | ' | primes) graphic -> small | large | symbol | digit | special | " | ' | primes varid -> (small { small | large | digit | ' | primes }) (EXCEPT reservedid) conid -> large { small | large | digit | ' | primes }
2. Introduce a new lexer nonterminal "subsup" that would include the Unicode sub/superscript[1] versions of numbers, "-", "+", "=", "(", ")", Latin and Greek letters. And allow these characters to be used in names and operators:
symbol -> ascSymbol | uniSymbol (EXCEPT special | _ | " | ' | primes | subsup ) digit -> ascDigit | uniDigit (EXCEPT subsup) small -> ascSmall | uniSmall (EXCEPT subsup) | _ large -> ascLarge | uniLarge (EXCEPT subsup) graphic -> small | large | symbol | digit | special | " | ' | primes | subsup varid -> (small { small | large | digit | ' | primes | subsup }) (EXCEPT reservedid) conid -> large { small | large | digit | ' | primes | subsup } varsym -> (symbol (EXCEPT :) {symbol | subsup}) (EXCEPT reservedop | dashes) consym -> (: {symbol | subsup}) (EXCEPT reservedop)
If this proposal is received favorably, I'll write a patch for GHC based on my previous stab at the problem[2].
P.S. I'm CC-ing Cafe for extra attention, but please keep the discussion to the GHC users list.
[1] https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts [2] https://ghc.haskell.org/trac/ghc/ticket/5108 _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
-- John Meacham - http://notanumber.net/

I personally like this idea. Mathematica allows all sorts of bizarre names
and it'd be cool for Haskell to be similar, so that mathematical Haskell
scripts and IHaskell notebooks can be just as fancy and incomprehensible as
dense Mathematica code!
Since GHC already accepts *some* unicode, I think it'd be a great idea to
extend it in this way.
On Sat, Jun 14, 2014 at 4:58 PM, John Meacham
I have this feature in jhc, where I have a 'trailing' character class that can appear at the end of both symbols and ids.
currently it consists of
$trailing = [₀₁₂₃₄₅₆₇₈₉⁰¹²³⁴⁵⁶⁷⁸⁹₍₎⁽⁾₊₋]
John
Hello lists,
As some of you may know, GHC's support for Unicode characters in lexemes is rather crude and hence prone to inconsistencies in their handling versus
ASCII counterparts. For example, APOSTROPHE is treated differently from PRIME:
λ> data a +' b = Plus a b <interactive>:3:9: Unexpected type ‘b’ In the data declaration for ‘+’ A data declaration should have form data + a b c = ... λ> data a +′ b = Plus a b
λ> let a' = 1 λ> let a′ = 1 <interactive>:10:8: parse error on input ‘=’
Also some rather bizarre looking things are accepted:
λ> let ᵤxᵤy = 1
In the spirit of improving things little by little I would like to
On Sat, Jun 14, 2014 at 7:48 AM, Mikhail Vorozhtsov
wrote: the propose: 1. Handle single/double/triple/quadruple Unicode PRIMEs the same way as APOSTROPHE, meaning the following alterations to the lexer:
primes -> U+2032 | U+2033 | U+2034 | U+2057 symbol -> ascSymbol | uniSymbol (EXCEPT special | _ | " | ' | primes) graphic -> small | large | symbol | digit | special | " | ' | primes varid -> (small { small | large | digit | ' | primes }) (EXCEPT
reservedid)
conid -> large { small | large | digit | ' | primes }
2. Introduce a new lexer nonterminal "subsup" that would include the Unicode sub/superscript[1] versions of numbers, "-", "+", "=", "(", ")", Latin and Greek letters. And allow these characters to be used in names and operators:
symbol -> ascSymbol | uniSymbol (EXCEPT special | _ | " | ' | primes | subsup ) digit -> ascDigit | uniDigit (EXCEPT subsup) small -> ascSmall | uniSmall (EXCEPT subsup) | _ large -> ascLarge | uniLarge (EXCEPT subsup) graphic -> small | large | symbol | digit | special | " | ' | primes | subsup varid -> (small { small | large | digit | ' | primes | subsup }) (EXCEPT reservedid) conid -> large { small | large | digit | ' | primes | subsup } varsym -> (symbol (EXCEPT :) {symbol | subsup}) (EXCEPT reservedop | dashes) consym -> (: {symbol | subsup}) (EXCEPT reservedop)
If this proposal is received favorably, I'll write a patch for GHC based on my previous stab at the problem[2].
P.S. I'm CC-ing Cafe for extra attention, but please keep the discussion to the GHC users list.
[1] https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts [2] https://ghc.haskell.org/trac/ghc/ticket/5108 _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
-- John Meacham - http://notanumber.net/ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (3)
-
Andrew Gibiansky
-
John Meacham
-
Mikhail Vorozhtsov