
Hi, In Haskell reference, I see the following definitions: uniWhite -> any Unicode character defined as whitespace; uniSmall -> any Unicode lowercase letter; uniLarge -> any uppercase or titlecase Unicode letter; uniSymbol -> any Unicode symbol or punctuation. Where do I get lists for those characters? My first attempt was to check: http://unicode.org/Public/UNIDATA/UnicodeData.txt and consider large anything marked as CAPITAL and small anything marked as SMALL. I didn't know what to guess about the symbols. Am I using the right reference? How can I recognize (or get a list of) valid uppercase and lowercase unicode letters, as well as symbols and punctuation? Thanks for your help, Maurício

You can't determine Unicode character properties by analyzing the names of the characters. Read chapter 4 of the standard: http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf and get the property values here: http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt It sounds like the properties you want are "Case" and "General Category". Maybe the spec should be more explicit on exactly how the definitions map onto Unicode properties, so there is no ambiguity. Deborah On Aug 25, 2008, at 6:15 PM, Maurí cio wrote:
Hi,
In Haskell reference, I see the following definitions:
uniWhite -> any Unicode character defined as whitespace;
uniSmall -> any Unicode lowercase letter;
uniLarge -> any uppercase or titlecase Unicode letter;
uniSymbol -> any Unicode symbol or punctuation.
Where do I get lists for those characters? My first attempt was to check:
http://unicode.org/Public/UNIDATA/UnicodeData.txt
and consider large anything marked as CAPITAL and small anything marked as SMALL. I didn't know what to guess about the symbols. Am I using the right reference? How can I recognize (or get a list of) valid uppercase and lowercase unicode letters, as well as symbols and punctuation?
Thanks for your help, Maurício
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On chapter 4 I see the following nice table in page 139. Do you think I can use it together with UnicodeData.txt to choose valid characters for Haskell? Here is the only place I found where names match with haskell syntax reference (uppercase, lowercase, punctuation, symbol). Thanks, Maurício Table 4-7. General Category Lu = Letter, uppercase Ll = Letter, lowercase Lt = Letter, titlecase Lm = Letter, modifier Lo = Letter, other Mn = Mark, nonspacing Mc = Mark, spacing combining Me = Mark, enclosing Nd = Number, decimal digit Nl = Number, letter No = Number, other Pc = Punctuation, connector Pd = Punctuation, dash Ps = Punctuation, open Pe = Punctuation, close Pi = Punctuation, initial quote (may behave like Ps or Pe depending on usage) Pf = Punctuation, final quote (may behave like Ps or Pe depending on usage) Po = Punctuation, other Sm = Symbol, math Sc = Symbol, currency Sk = Symbol, modifier So = Symbol, other Zs = Separator, space Zl = Separator, line Zp = Separator, paragraph Cc = Other, control Cf = Other, format Cs = Other, surrogate Co = Other, private use Cn = Other, not assigned (including noncharacters) Deborah Goldsmith a écrit :
You can't determine Unicode character properties by analyzing the names of the characters.
Read chapter 4 of the standard: http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf
and get the property values here: http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
It sounds like the properties you want are "Case" and "General Category". Maybe the spec should be more explicit on exactly how the definitions map onto Unicode properties, so there is no ambiguity.
Deborah
On Aug 25, 2008, at 6:15 PM, Maurí cio wrote:
Hi,
In Haskell reference, I see the following definitions:
uniWhite -> any Unicode character defined as whitespace;
uniSmall -> any Unicode lowercase letter;
uniLarge -> any uppercase or titlecase Unicode letter;
uniSymbol -> any Unicode symbol or punctuation.
Where do I get lists for those characters? My first attempt was to check:
http://unicode.org/Public/UNIDATA/UnicodeData.txt
and consider large anything marked as CAPITAL and small anything marked as SMALL. I didn't know what to guess about the symbols. Am I using the right reference? How can I recognize (or get a list of) valid uppercase and lowercase unicode letters, as well as symbols and punctuation?
Thanks for your help, Maurício
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

No, the general category is not enough. Please read both references. As you can tell from DerivedCoreProperties.txt, for example: # Derived Property: Uppercase # Generated from: Lu + Other_Uppercase So general category Lu is not the same thing as "Uppercase" Deborah On Aug 25, 2008, at 7:18 PM, Maurí cio wrote:
On chapter 4 I see the following nice table in page 139. Do you think I can use it together with UnicodeData.txt to choose valid characters for Haskell? Here is the only place I found where names match with haskell syntax reference (uppercase, lowercase, punctuation, symbol).
Thanks, Maurício
Table 4-7. General Category
Lu = Letter, uppercase Ll = Letter, lowercase Lt = Letter, titlecase Lm = Letter, modifier Lo = Letter, other Mn = Mark, nonspacing Mc = Mark, spacing combining Me = Mark, enclosing Nd = Number, decimal digit Nl = Number, letter No = Number, other Pc = Punctuation, connector Pd = Punctuation, dash Ps = Punctuation, open Pe = Punctuation, close Pi = Punctuation, initial quote (may behave like Ps or Pe depending on usage) Pf = Punctuation, final quote (may behave like Ps or Pe depending on usage) Po = Punctuation, other Sm = Symbol, math Sc = Symbol, currency Sk = Symbol, modifier So = Symbol, other Zs = Separator, space Zl = Separator, line Zp = Separator, paragraph Cc = Other, control Cf = Other, format Cs = Other, surrogate Co = Other, private use Cn = Other, not assigned (including noncharacters)
Deborah Goldsmith a écrit :
You can't determine Unicode character properties by analyzing the names of the characters. Read chapter 4 of the standard: http://www.unicode.org/versions/Unicode5.0.0/ch04.pdf and get the property values here: http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt It sounds like the properties you want are "Case" and "General Category". Maybe the spec should be more explicit on exactly how the definitions map onto Unicode properties, so there is no ambiguity. Deborah On Aug 25, 2008, at 6:15 PM, Maurí cio wrote:
Hi,
In Haskell reference, I see the following definitions:
uniWhite -> any Unicode character defined as whitespace;
uniSmall -> any Unicode lowercase letter;
uniLarge -> any uppercase or titlecase Unicode letter;
uniSymbol -> any Unicode symbol or punctuation.
Where do I get lists for those characters? My first attempt was to check:
http://unicode.org/Public/UNIDATA/UnicodeData.txt
and consider large anything marked as CAPITAL and small anything marked as SMALL. I didn't know what to guess about the symbols. Am I using the right reference? How can I recognize (or get a list of) valid uppercase and lowercase unicode letters, as well as symbols and punctuation?
Thanks for your help, Maurício
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

2008/8/26 Deborah Goldsmith
It sounds like the properties you want are "Case" and "General Category". Maybe the spec should be more explicit on exactly how the definitions map onto Unicode properties, so there is no ambiguity.
This is proposed for Haskell'. http://hackage.haskell.org/trac/haskell-prime/wiki/UnicodeInHaskellSource says: "The report should at least be absolutely clear about which Unicode character properties (N, Ll, Lu, Sm, etc.) correspond to which lexical class in the syntax." I don't know if there's any difference in how current Haskell compilers handle this. Andy
participants (3)
-
Andy Smith
-
Deborah Goldsmith
-
Maurício