[GHC] #10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do

#10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do -------------------------------------+------------------------------------- Reporter: | Owner: Artyom.Kazak | Status: new Type: bug | Milestone: Priority: normal | Version: 7.10.1 Component: | Operating System: Unknown/Multiple libraries/base | Type of failure: None/Unknown Keywords: unicode | Blocked By: Architecture: | Related Tickets: Unknown/Multiple | Test Case: | Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- {{{#!hs
isMark '\768' True
isAlphaNum '\768' True
(isAlpha '\768', isNumber '\768') (False,False) }}}
This behavior comes from this piece in WCsubst.c: {{{ unipred(u_iswalnum,(GENCAT_LT|GENCAT_LU|GENCAT_LL|GENCAT_LM|GENCAT_LO| GENCAT_MC|GENCAT_ME|GENCAT_MN| GENCAT_NO|GENCAT_ND|GENCAT_NL)) }}} I'm not sure what should be done here. Is it a bug with isAlpaNum? Or with isAlpha? How does it correspond to iswalnum's behavior in C++? (And if it's a feature and not a bug, then it should definitely be documented.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10412 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do -------------------------------------+------------------------------------- Reporter: Artyom.Kazak | Owner: Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.1 Resolution: | Keywords: unicode Operating System: Unknown/Multiple | Architecture: Type of failure: None/Unknown | Unknown/Multiple Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Comment (by hvr): For the record, this was already an issue on GHC 7.8.4 (through GHC 7.0.4): {{{ GHCi, version 7.0.4: http://www.haskell.org/ghc/ :? for help λ> import Data.Char λ> length $ filter isMark $ filter (\c -> isAlphaNum c /= (isAlpha c && isNumber c)) ['\0'..] 1281 }}} {{{ GHCi, version 7.8.4: http://www.haskell.org/ghc/ :? for help λ> import Data.Char λ> length $ filter isMark $ filter (\c -> isAlphaNum c /= (isAlpha c && isNumber c)) ['\0'..] 1498 }}} {{{ GHCi, version 7.10.1.20150511: http://www.haskell.org/ghc/ :? for help λ> import Data.Char λ> length $ filter isMark $ filter (\c -> isAlphaNum c /= (isAlpha c && isNumber c)) ['\0'..] 1830 }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10412#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do -------------------------------------+------------------------------------- Reporter: Artyom.Kazak | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.1 Resolution: | Keywords: unicode Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by lelf): * cc: lelf (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10412#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do -------------------------------------+------------------------------------- Reporter: Artyom.Kazak | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.1 Resolution: | Keywords: unicode, | newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * keywords: unicode => unicode, newcomer -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10412#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do -------------------------------------+------------------------------------- Reporter: Artyom.Kazak | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.1 Resolution: | Keywords: unicode, | newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by sighingnow): `GENCAT_MC|GENCAT_ME|GENCAT_MN` has been included in `u_iswalnum` since more than 10 years ago. However the documentation of `isAlphaNum` says "Selects alphabetic or numeric digit Unicode characters" and doesn't mention the "mark" characters. Should we fix the documentation of `isAlphaNum` to include "mark" characters or keep the documentation as it is and fix `u_iswalnum`? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10412#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do -------------------------------------+------------------------------------- Reporter: Artyom.Kazak | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.1 Resolution: | Keywords: unicode, | newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by Azel): From what I can see on various C and C++ documentations (i.e. [https://docs.microsoft.com/en-gb/cpp/c-runtime-library/reference/isalnum- iswalnum-isalnum-l-iswalnum-l Microsoft's], [https://www.gnu.org/software/libc/manual/html_node/Classification-of- Wide-Characters.html#Classification-of-Wide-Characters the glibc's] or [http://en.cppreference.com/w/cpp/string/wide/iswalnum cppreference.com's] which refers us [http://www.open-std.org/JTC1/SC35/WG5/docs/30112d10.pdf here]) `iswalnum`'s behaviour should be to return `True` if either of `iswalpha` or `iswdigit` does, so I guess `isAlphaNum` ought to do the same. That is, keeping the documentation as it is and fixing `u_iswalnum`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10412#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do -------------------------------------+------------------------------------- Reporter: Artyom.Kazak | Owner: (none) Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.1 Resolution: | Keywords: unicode, | newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by Azel): Looking a bit farther afield, all languages I see who have an `isAlphaNum` equivalent define it as returning `True` if either of their `isAlpha` or `isNumber` equivalents do (e.g. [https://docs.oracle.com/javase/9/docs/api/java/lang/Character.html #isLetterOrDigit-int- Java's], [http://msdn.microsoft.com/en- gb/library/cay4xx2f(v=vs.110).aspx the .NET Framework's], [http://www.lispworks.com/documentation/HyperSpec/Body/13_ade.htm Common Lisp's], [https://docs.python.org/3/library/stdtypes.html#str.isalnum Python's] — with the particularity in Python's documentation that they put three functions to match on numbers in `isalnum`'s description but the first two are subsumed by the third… — or [http://www.ada- auth.org/standards/12rm/html/RM-A-3-5.html Ada's]). So I'm willing to have a go at solving that ticket and would be in favour of fixing `u_iswalnum` and keeping the doc mostly as it is: it states that `isAlphaNum` selects alphabetic or numeric digit Unicode characters and currently, even if we remove the mark characters, it doesn't matches only that because it matches also `GENCAT_NO` and `GENCAT_NL`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10412#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do -------------------------------------+------------------------------------- Reporter: Artyom.Kazak | Owner: Azel Type: bug | Status: new Priority: normal | Milestone: Component: libraries/base | Version: 7.10.1 Resolution: | Keywords: unicode, | newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by Azel): * owner: (none) => Azel -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10412#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do -------------------------------------+------------------------------------- Reporter: Artyom.Kazak | Owner: Azel Type: bug | Status: patch Priority: normal | Milestone: Component: libraries/base | Version: 7.10.1 Resolution: | Keywords: unicode, | newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4593 Wiki Page: | -------------------------------------+------------------------------------- Changes (by Azel): * status: new => patch * differential: => Phab:D4593 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10412#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do
-------------------------------------+-------------------------------------
Reporter: Artyom.Kazak | Owner: Azel
Type: bug | Status: patch
Priority: normal | Milestone:
Component: libraries/base | Version: 7.10.1
Resolution: | Keywords: unicode,
| newcomer
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s): Phab:D4593
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari

#10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do -------------------------------------+------------------------------------- Reporter: Artyom.Kazak | Owner: Azel Type: bug | Status: closed Priority: normal | Milestone: 8.6.1 Component: libraries/base | Version: 7.10.1 Resolution: fixed | Keywords: unicode, | newcomer Operating System: Unknown/Multiple | Architecture: | Unknown/Multiple Type of failure: None/Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D4593 Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: patch => closed * resolution: => fixed * milestone: => 8.6.1 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10412#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#10412: isAlphaNum includes mark characters, but neither isAlpha nor isNumber do
-------------------------------------+-------------------------------------
Reporter: Artyom.Kazak | Owner: Azel
Type: bug | Status: closed
Priority: normal | Milestone: 8.6.1
Component: libraries/base | Version: 7.10.1
Resolution: fixed | Keywords: unicode,
| newcomer
Operating System: Unknown/Multiple | Architecture:
| Unknown/Multiple
Type of failure: None/Unknown | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s): Phab:D4593
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by Ben Gamari
participants (1)
-
GHC