[GHC] #8524: GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals
#8524: GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals ----------------------------+---------------------------------------------- Reporter: oerjan | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: Compiler | Version: 7.6.3 Keywords: | Operating System: Unknown/Multiple Architecture: | Type of failure: GHC rejects valid program Unknown/Multiple | Test Case: Difficulty: Unknown | Blocking: Blocked By: | Related Tickets: | ----------------------------+---------------------------------------------- GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals. (And I don't like either option, why leave out any characters in strings unnecessarily?) Examples from ghci 7.6.3 (also tested in lambdabot on irc): {{{ Prelude> "" -- Unicode char \8203, Format class. <interactive>:10:2: lexical error in string/character literal at character '\8203' Prelude> " " -- Unicode char \8202, Space class. "\8202" Prelude> "t\ \est" -- Unicode char \8202 in a string gap. <interactive>:14:4: lexical error in string/character literal at character '\8202' }}} My reading of http://www.haskell.org/onlinereport/haskell2010/haskellch2.html (section 2.2 and 2.6): * The report BNF token "graphic", which can be used in literals, includes indirectly many Unicode classes, but uniWhite is not one of them. Thus the only Unicode whitespace allowed to represent itself in literals is ASCII space. * Unicode formatting characters are not mentioned in the BNF that I can see, so are not allowed in literals. * String gaps are made out of the report BNF token whitespace, which ''does'' include uniWhite. Who wants what: || ||= GHC =||= Report =||= Me =|| || Format in string || No || No || Yes || || Space/uniWhite in string || Yes || No || Yes || || Space/uniWhite in string gap || No || Yes || Dunno || In short, GHC's behavior is buggy and/or annoying in two opposite ways: * It leaves out some Unicode characters as allowable in strings and character literals, presumably because the report says so. * It allows some characters the report says it ''shouldn't'', and refuses some characters the report says it ''should''. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8524 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#8524: GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals -------------------------------------+------------------------------------- Reporter: oerjan | Owner: Type: bug | Status: new Priority: low | Milestone: Component: Compiler | Version: 7.6.3 Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by thomie): * keywords: => newcomer * priority: normal => low Comment: We should probably follow the report here. Shouldn't be too difficult. The file to change is `compiler/parser/Lexer.x`. Look at the functions `lex_string` and `lex_stringgap`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8524#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#8524: GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals -------------------------------------+------------------------------------- Reporter: oerjan | Owner: Type: bug | Status: new Priority: low | Milestone: Component: Compiler | Version: 7.6.3 (Parser) | Keywords: newcomer Resolution: | Architecture: Operating System: Unknown/Multiple | Unknown/Multiple Type of failure: GHC rejects | Test Case: valid program | Blocking: Blocked By: | Differential Revisions: Related Tickets: | -------------------------------------+------------------------------------- Changes (by thomie): * component: Compiler => Compiler (Parser) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8524#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#8524: GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals -------------------------------------+------------------------------------- Reporter: oerjan | Owner: | RyanGlScott Type: bug | Status: new Priority: low | Milestone: Component: Compiler | Version: 7.6.3 (Parser) | Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: Phab:D1235 -------------------------------------+------------------------------------- Changes (by RyanGlScott): * owner: => RyanGlScott * differential: => Phab:D1235 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8524#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#8524: GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals -------------------------------------+------------------------------------- Reporter: oerjan | Owner: | RyanGlScott Type: bug | Status: new Priority: low | Milestone: Component: Compiler | Version: 7.6.3 (Parser) | Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: Phab:D1235 -------------------------------------+------------------------------------- Comment (by rwbarton): My "Me" column would go Dunno, Yes, Yes. Particularly for the second case (Space/uniWhite in string), I don't see how allowing it would cause much harm and there may well be people using this behavior already. Maybe it would be better to just document this deviation from the Report? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8524#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#8524: GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals -------------------------------------+------------------------------------- Reporter: oerjan | Owner: | RyanGlScott Type: bug | Status: new Priority: low | Milestone: Component: Compiler | Version: 7.6.3 (Parser) | Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: Phab:D1235 -------------------------------------+------------------------------------- Comment (by RyanGlScott): It looks like I might have jumped the gun on this a bit too early! That's okay, though—let's figure out what exactly needs to change. It looks like people want Space/uniWhite in strings, which GHC already allows, so no changes are needed there. At least one person has expressed the desire for inclusion of Format in string and Space uniWhite in string gaps, so unless anyone has any objections, should we push to make GHC as inclusive as possible w.r.t. Unicode? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8524#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#8524: GHC is inconsistent with the Haskell Report on which Unicode characters are
allowed in string and character literals
-------------------------------------+-------------------------------------
Reporter: oerjan | Owner:
| RyanGlScott
Type: bug | Status: new
Priority: low | Milestone:
Component: Compiler | Version: 7.6.3
(Parser) |
Resolution: | Keywords: newcomer
Operating System: Unknown/Multiple | Architecture:
Type of failure: GHC rejects | Unknown/Multiple
valid program | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Revisions: Phab:D1235
-------------------------------------+-------------------------------------
Changes (by thomie):
* cc: hvr (added)
Comment:
@RyanGlScott: sorry about that, I should not have put the newcomer keyword
on this ticket prematurely.
Some code:
* Whitespace characters that the report excludes from strings:
{{{
> delete '\SP' $ filter isSpace ['\0'..]
"\t\n\v\f\r\160\5760\8192\8193\8194\8195\8196\8197\8198\8199\8200\8201\8202\8239\8287\12288"
}}}
* Whitespace characters that GHC excludes from strings:
{{{
> filter (\c -> generalCategory c == Control && isSpace c) ['\0'..]
"\t\n\v\f\r"
}}}
* `generalCategories` that the report and GHC also exclude from strings:
{{{
> nub $ map generalCategory $ filter (not . isPrint) ['\0'..]
[Control,Format,NotAssigned,LineSeparator,ParagraphSeparator,Surrogate,PrivateUse]
}}}
If we're going to be "as inclusive as possible", why not allow all of
these? Are there any downsides to this? Perhaps under a new flag
`FullUnicodeStrings`, enabled by default and disabled in Haskell98 and
Haskell2010 mode.
I'm also ok with just mentioning the current deviation from the report in
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/bugs-and-
infelicities.html.
--
Ticket URL:
GHC
The Glasgow Haskell Compiler
If we're going to be "as inclusive as possible", why not allow all of
#8524: GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals -------------------------------------+------------------------------------- Reporter: oerjan | Owner: | RyanGlScott Type: bug | Status: new Priority: low | Milestone: Component: Compiler | Version: 7.6.3 (Parser) | Resolution: | Keywords: newcomer Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Revisions: Phab:D1235 -------------------------------------+------------------------------------- Comment (by oerjan): Replying to [comment:6 thomie]: these? Are there any downsides to this? Perhaps under a new flag `FullUnicodeStrings`, enabled by default and disabled in Haskell98 and Haskell2010 mode. A couple thoughts: Allowing \n in strings severely messes with layout. I've always assumed that's why it was disallowed in the first place. I suppose this also applies to the other *Separators and perhaps "\v\f\r" too (which would probably be *Separators if they weren't grandfathered as Control). And \r has the varying newline encoding issue. Unless I'm severely mistaken, Surrogate only exists because of the Unicode multiple encodings mess, and shouldn't ever really be ''used'' in UTF-8. I guess including them is fairly harmless but might trip up someone doing a bad encoding conversion. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8524#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#8524: GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals -------------------------------------+------------------------------------- Reporter: oerjan | Owner: RyanGlScott Type: bug | Status: new Priority: low | Milestone: Component: Compiler | Version: 7.6.3 (Parser) | Resolution: | Keywords: unicode Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D1235 Wiki Page: | -------------------------------------+------------------------------------- Changes (by thomie): * keywords: newcomer => unicode -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8524#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
#8524: GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals -------------------------------------+------------------------------------- Reporter: oerjan | Owner: Type: bug | Status: new Priority: low | Milestone: Component: Compiler | Version: 7.6.3 (Parser) | Resolution: | Keywords: unicode Operating System: Unknown/Multiple | Architecture: Type of failure: GHC rejects | Unknown/Multiple valid program | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D1235 Wiki Page: | -------------------------------------+------------------------------------- Changes (by RyanGlScott): * owner: RyanGlScott => -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8524#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC