[GHC] #10196: Regression regarding Unicode subscript characters in identifiers

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.2 Component: Compiler | Version: 7.10.1 (Parser) | Operating System: Unknown/Multiple Keywords: | Type of failure: GHC rejects Architecture: | valid program Unknown/Multiple | Blocked By: Test Case: | Related Tickets: #5108 Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- As reported by both hvr as user Yongqian Li: The Unicode7 update in GHC 7.10 had the side effect of breaking code making use of subscript symbols that did compile with GHC 7.8.4, but won't anymore with GHC 7.10.1: For instance, GHCi 7.8.4 accepts let xᵦ = 1 let xᵤ = 1 let xᵩ = 1 let xᵢ = 1 let xᵪ = 1 let xᵣ = 1 let xₙ = 1 whereas GHC 7.10.1RC fails parsing those with a lexical error. (NB: GHC 7.8 does not accept *all* latin subscript letters either). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.2 Component: Compiler | Version: 7.10.1 (Parser) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: GHC rejects | Test Case: valid program | Blocking: Blocked By: | Differential Revisions: Related Tickets: #5108 | -------------------------------------+------------------------------------- Comment (by thomie): I think Simon's suggested change in [https://ghc.haskell.org/trac/ghc/ticket/5108#comment:4 #5108] would fix this: "allow the category Lm (MODIFIER LETTER) as part of an identifier? That would include all the primes and subscript/superscript things." -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.2 Component: Compiler | Version: 7.10.1 (Parser) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: GHC rejects | Test Case: valid program | Blocking: Blocked By: | Differential Revisions: Related Tickets: #5108 | -------------------------------------+------------------------------------- Old description:
As reported by both hvr as user Yongqian Li:
The Unicode7 update in GHC 7.10 had the side effect of breaking code making use of subscript symbols that did compile with GHC 7.8.4, but won't anymore with GHC 7.10.1:
For instance, GHCi 7.8.4 accepts
let xᵦ = 1 let xᵤ = 1 let xᵩ = 1 let xᵢ = 1 let xᵪ = 1 let xᵣ = 1 let xₙ = 1
whereas GHC 7.10.1RC fails parsing those with a lexical error. (NB: GHC 7.8 does not accept *all* latin subscript letters either).
New description: As reported by both hvr as user Yongqian Li: The [changeset:d4fd16801bc59034abdc6214e60fcce2b21af9c8 Unicode 7.0 update] in GHC 7.10 had the side effect of breaking code making use of subscript symbols that did compile with GHC 7.8.4, but won't anymore with GHC 7.10.1: For instance, GHCi 7.8.4 accepts {{{#!hs let xᵦ = 1 let xᵤ = 1 let xᵩ = 1 let xᵢ = 1 let xᵪ = 1 let xᵣ = 1 let xₙ = 1 }}} whereas GHC 7.10.1RC fails parsing those with a lexical error. (NB: GHC 7.8 does not accept ''all'' latin subscript letters either). -- Comment (by hvr): Minor markup improvement in ticket-description -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.2 Component: Compiler | Version: 7.10.1 (Parser) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: GHC rejects | Test Case: valid program | Blocking: Blocked By: | Differential Revisions: Related Tickets: #5108 | -------------------------------------+------------------------------------- Comment (by thomie): But perhaps don't allow a "MODIFIER LETTER" as the first character of an identifier. We're unfortunately outside of the report territory here. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.2 Component: Compiler | Version: 7.10.1 (Parser) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: GHC rejects | Test Case: valid program | Blocking: Blocked By: | Differential Revisions: Related Tickets: #5108 | -------------------------------------+------------------------------------- Comment (by yongqli): @Thomas Miedema, Yes, subscript characters only starting from the second character on is fine for me. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: hvr Type: bug | Status: new Priority: normal | Milestone: 7.10.2 Component: Compiler | Version: 7.10.1 (Parser) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: GHC rejects | Test Case: valid program | Blocking: Blocked By: | Differential Revisions: Related Tickets: #5108 | -------------------------------------+------------------------------------- Changes (by hvr): * owner: => hvr Comment: We're planning to allow `Lm` from the 2nd character on in an identifier for 7.10.2 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | thoughtpolice Priority: normal | Status: new Component: Compiler | Milestone: 7.10.2 (Parser) | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: #5108 | Differential Revisions: -------------------------------------+------------------------------------- Changes (by thoughtpolice): * owner: hvr => thoughtpolice -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | thoughtpolice Priority: normal | Status: patch Component: Compiler | Milestone: 7.10.2 (Parser) | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: #5108 | Differential Revisions: Phab:D969 -------------------------------------+------------------------------------- Changes (by thoughtpolice): * status: new => patch * differential: => Phab:D969 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | thoughtpolice Priority: normal | Status: patch Component: Compiler | Milestone: 7.10.3 (Parser) | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: #5108 | Differential Revisions: Phab:D969 -------------------------------------+------------------------------------- Changes (by thoughtpolice): * milestone: 7.10.2 => 7.10.3 Comment: I think we're going to end up punting this to 7.10.3 at least, because the current patch has some regressions, and this is ultimately fairly minor. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | thoughtpolice Priority: normal | Status: patch Component: Compiler | Milestone: 7.10.3 (Parser) | Version: 7.10.1 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: #5108 | Differential Revisions: Phab:D969 -------------------------------------+------------------------------------- Comment (by thomie): Replying to [comment:5 hvr]:
We're planning to allow `Lm` from the 2nd character on in an identifier for 7.10.2
The current patch does exactly this. It still needs a changelog entry. I would have preferred to only allow `Lm` in the suffix of an identifier. But we can leave that for 7.12 or later, as there is a slight chance it breaks someone's code. We could mention it in the docs. There's also the issue that ModifierLetter perhaps brings in too many weird characters: "15-06-18T11:46:27"< hvr@> thomie: can we easily list all modifier letters in Haskell? "15-06-18T11:47:55"< hvr@> [ c | c <- ['\0'..], generalCategory c == ModifierLetter ] "15-06-18T11:47:56"< hvr@> got it "15-06-18T11:48:56"< hvr@> ok, there's a lot in there one doesn't want to allow in identifiers :-/ "15-06-18T11:49:31"< thomie > booh "15-06-18T11:49:50"< hvr@> these look nasty: "15-06-18T11:50:46"< hvr@> so many column variants, theres also "ː" hvr: do you think this a big enough issue to not proceed with the current patch? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers
-------------------------------------+-------------------------------------
Reporter: thomie | Owner:
Type: bug | thoughtpolice
Priority: normal | Status: patch
Component: Compiler | Milestone: 7.10.3
(Parser) | Version: 7.10.1
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: GHC rejects | Unknown/Multiple
valid program | Test Case:
Blocked By: | Blocking:
Related Tickets: #5108 | Differential Revisions: Phab:D969
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | thoughtpolice Priority: normal | Status: closed Component: Compiler | Milestone: 7.10.3 (Parser) | Version: 7.10.1 Resolution: fixed | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: #5108 | Differential Revisions: Phab:D969 -------------------------------------+------------------------------------- Changes (by bgamari): * status: patch => closed * resolution: => fixed -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers
-------------------------------------+-------------------------------------
Reporter: thomie | Owner:
Type: bug | thoughtpolice
Priority: normal | Status: closed
Component: Compiler | Milestone: 7.10.3
(Parser) | Version: 7.10.1
Resolution: fixed | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: GHC rejects | Unknown/Multiple
valid program | Test Case:
Blocked By: | Blocking:
Related Tickets: #5108 | Differential Revisions: Phab:D969
-------------------------------------+-------------------------------------
Comment (by hvr):
Replying to [comment:10 Ben Gamari
parser: Allow Lm (MODIFIER LETTER) category in identifiers
Easy fix in the parser to stop regressions, due to Unicode 7.0 changing the classification of some prior code points.
nitpick: the way the commit message is worded (as well the comments in this ticket) suggests that e.g. `xᵦx` is now a valid identifier... which it isn't... -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.3 Component: Compiler | Version: 7.10.1 (Parser) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: GHC rejects | Test Case: valid program | Blocking: Blocked By: | Differential Revisions: Phab:D969 Related Tickets: #5108 | -------------------------------------+------------------------------------- Changes (by hvr): * owner: thoughtpolice => * status: closed => new * resolution: fixed => -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.3 Component: Compiler | Version: 7.10.1 (Parser) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: GHC rejects | Test Case: valid program | Blocking: Blocked By: | Differential Revisions: Phab:D969 Related Tickets: #5108 | -------------------------------------+------------------------------------- Comment (by hvr): I'm reopening this temporarily, because GHC 7.8.4 does in fact accept e.g. {{{ λ:6> let xᵦx = () xᵦx :: () }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | Status: merge Priority: normal | Milestone: 7.10.2 Component: Compiler | Version: 7.10.1 (Parser) | Keywords: Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: GHC rejects | Test Case: valid program | Blocking: Blocked By: | Differential Revisions: Phab:D969 Related Tickets: #5108 | -------------------------------------+------------------------------------- Changes (by thomie): * status: new => merge * milestone: 7.10.3 => 7.10.2 Comment: Please merge. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | Status: closed Priority: normal | Milestone: 7.10.2 Component: Compiler | Version: 7.10.1 (Parser) | Keywords: Resolution: fixed | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: GHC rejects | Test Case: valid program | Blocking: Blocked By: | Differential Revisions: Phab:D969 Related Tickets: #5108 | -------------------------------------+------------------------------------- Changes (by bgamari): * status: merge => closed * resolution: => fixed Comment: This has been merged to `ghc-7.10` as 358e0a8d4cb49baa29cf6b001eaa9d4ac428bb2d. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.10.2 Component: Compiler | Version: 7.10.1 (Parser) | Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: #5108 | Differential Rev(s): Phab:D969 Wiki Page: | -------------------------------------+------------------------------------- Changes (by nomeata): * status: closed => new * resolution: fixed => Comment: Sorry, to bring this up late, but the report specifies “Haskell compilers are expected to make use of new versions of Unicode as they are made available.” So if we deviate from that, we should make sure that * the user’s guide explicitly lists all deviations from the report [in this section](https://downloads.haskell.org/~ghc/latest/docs/html/users_guide /bugs-and-infelicities.html#infelicities-lexical), and * that the Haskell prime committee is going to be aware of these (sensible) deviations, so that they can become official. This is also important for, e.g. #11012. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers -------------------------------------+------------------------------------- Reporter: thomie | Owner: Type: bug | Status: closed Priority: normal | Milestone: 7.10.2 Component: Compiler | Version: 7.10.1 (Parser) | Keywords: unicode, Resolution: fixed | report-impact Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: GHC rejects | Test Case: valid program | parser/should_compile/T10196, | parser/should_fail/T10196Fail1, | parser/should_fail/T10196Fail2 Blocked By: | Blocking: Related Tickets: #5108 | Differential Rev(s): Phab:D969 Wiki Page: | -------------------------------------+------------------------------------- Changes (by thomie): * status: new => closed * testcase: => parser/should_compile/T10196, parser/should_fail/T10196Fail1, parser/should_fail/T10196Fail2 * resolution: => fixed * keywords: => unicode, report-impact Comment:
So if we deviate
I've opened #11609 for that. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10196#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10196: Regression regarding Unicode subscript characters in identifiers
-------------------------------------+-------------------------------------
Reporter: thomie | Owner:
Type: bug | Status: closed
Priority: normal | Milestone: 7.10.2
Component: Compiler | Version: 7.10.1
(Parser) | Keywords: unicode,
Resolution: fixed | report-impact
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: GHC rejects | Test Case:
valid program | parser/should_compile/T10196,
| parser/should_fail/T10196Fail1,
| parser/should_fail/T10196Fail2
Blocked By: | Blocking:
Related Tickets: #5108 | Differential Rev(s): Phab:D969
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Thomas Miedema
participants (1)
-
GHC