[GHC] #8730: Invalid Unicode Codepoints in Char

#8730: Invalid Unicode Codepoints in Char ------------------------------------+------------------------------------- Reporter: mdmenzel | Owner: Type: bug | Status: new Priority: low | Milestone: Component: Compiler | Version: 7.6.3 Keywords: unicode | Operating System: Unknown/Multiple Architecture: Unknown/Multiple | Type of failure: None/Unknown Difficulty: Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | ------------------------------------+------------------------------------- The surrogate range in Unicode is supposed to (as of Unicode 2.0, 1996) be a range of invalid code points yet, Data.Char allows the use of values in this range (in fact, it even gives them their own GeneralCategory). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8730 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8730: Invalid Unicode Codepoints in Char -------------------------------------+------------------------------------- Reporter: mdmenzel | Owner: ekmett Type: bug | Status: new Priority: low | Milestone: Component: Core | Version: 7.6.3 Libraries | Keywords: unicode Resolution: | Architecture: Unknown/Multiple Operating System: | Difficulty: Unknown Unknown/Multiple | Blocked By: Type of failure: | Related Tickets: None/Unknown | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by thomie): * cc: batterseapower, core-libraries-committee@… (added) * owner: => ekmett * component: Compiler => Core Libraries Comment: Thank you for the report. I am just adding some references. {{{ Prelude Data.Char> all ((==) Surrogate . generalCategory) ['\xdc80' .. '\xdfff'] True }}} * http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf * http://tools.ietf.org/html/rfc3629 * http://en.wikipedia.org/wiki/UTF-8#Invalid_code_points:
According to the UTF-8 definition (RFC 3629) the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode values, and their UTF-8 encoding should be treated as an invalid byte sequence. Whether an actual application should do this is debatable, as it makes it impossible to store invalid UTF-16 (that is, UTF-16 with unpaired surrogate halves) in a UTF-8 string. This is necessary to store unchecked UTF-16 such as Windows filenames as UTF-8. It is also incompatible with CESU encoding (described below).
In commit dc58b7398910a433259a6c0f58a0d05a48555191: {{{ Author: Max Bolingbroke <> Date: Sat May 14 22:50:46 2011 +0100 Big patch to improve Unicode support in GHC. Validated on OS X and Windows, this patch series fixes #5061, #1414, #3309, #3308, #3307, #4006 and #4855. }}} This commit adds checks like `... if isSurrogate c then done InvalidSequence ir ow else do ...` to GHC/IO/Encoding/UTF{8|16|32}.hs -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8730#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC