
On Thu, Oct 22, 2009 at 07:56:56PM -0700, Ahn, Ki Yung wrote:
Ahn, Ki Yung 쓴 글:
In the #haskell IRC channel, we just had a discussion on Data.Char predicates such as isAlpha, isUpper, isLower. The implementation of Data.Char is not Haskell 98 since Char specification in Haskell 98 only covers latin1.
Char in Haskell98 covers Unicode too; http://haskell.org/onlinereport/char.html says: Function toUpper converts a letter to the corresponding upper-case letter, leaving any other character unchanged. Any Unicode letter which has an upper-case equivalent is transformed. Similarly, toLower converts a letter to the corresponding lower-case letter, leaving any other character unchanged.
However, current predicates are confusing and intuitive properties does not hold. One example is this:
[17:53:32] <newsham> > let cs = [minBound..maxBound]; us = filter isUpper cs; ls = filter isLower cs in take 5 $ (map toUpper ls) \\ us [17:53:33] <lambdabot> "\170\186\223I\312"
isLower '\170' == True but you can't turn that into an uppercase letter. isUpper '170' == '\170'.
What behaviour would you expect?
Another problem is that, in the Haskell 98 Report, isAlpha is defined as isLower or isUpper. This is different from the current implementation. What isAlhpa is categorizing is all the "Letter" categories.
Right, we have: isLower = "Letter, Lowercase" isUpper = "Letter, Uppercase" or "Letter, Titlecase" isAlpha = "Letter, Lowercase" or "Letter, Uppercase" or "Letter, Titlecase" or "Letter, Modifier" or "Letter, Other" The report says: any alphabetic character which is not lower case is treated as upper case (Unicode actually has three cases: upper, lower, and title" and defines: isAlpha c = isUpper c || isLower c so the implementation is not consistent with the language definition. I wouldn't like to say which is "wrong", though (but I would guess "both" :-) I think it would be great if someone were to design a new interface that provided something closer to the Unicode spec, perhaps in Data.Char.Unicode; we could make the current interface a layer on top).
So, wouldn't it be better to keep isAlpha to follow the definition of the Haskell 98 report, and just define a new predicate called isLetter if needed?
If your idea is to improve the handling of '\170' then this won't help. '\170' is "Letter, Lowercase". Thanks Ian