Re: What is a punctuation character?

20 Mar 2012

      Hello,

So I looked at what GHC does with Unicode and to me it is seems quite
reasonable:

* The alphabet is Unicode code points, so a valid Haskell program is
simply a list of those.
* Combining characters are not allowed in identifiers, so no need for
complex normalization rules: programs should always use the "short"
version of a character, or be rejected.
* Combining characters may appear in string literals, and there they
are left "as is" without any modification (so some string literals may
be longer than what's displayed in a text editor.)

Perhaps this is simply what the report already states (I haven't
checked, for which I apologize) but, if not, perhaps we should clarify
things.

-Iavor
PS:  I don't think that there is any need to specify a particular
representation for the unicode code-points (e.g., utf-8 etc.) in the
language standard.

On Fri, Mar 16, 2012 at 6:23 PM, Iavor Diatchki
 wrote:
...
Hello,
I am also not an expert but I got curious and did a bit of Wikipedia
reading.  Based on what I understood, here are two (related) questions
that it might be nice to clarify in a future version of the report:
1. What is the alphabet used by the grammar in the Haskell report?  My
understanding is that the intention is that the alphabet is unicode
codepoints (sometimes referred to as unicode characters).  There is no
way to refer to specific code-points by escaping as in Java (the link
that Gaby shared), you just have to write the code-points directly
(and there are plenty of encodings for doing that, e.g. UTF-8 etc.)
2. Do we respect "unicode equivalence"
(http://en.wikipedia.org/wiki/Canonical_equivalence) in Haskell source
code.  The issue here is that, apparently, some sequences of unicode
code points/characters are supposed to be morally the same.  For
example, it would appear that there are two different ways to write
the Spanish letter ñ: it has its own number, but it can also be made
by writing "n" followed by a modifier to put the wavy sign on top.
I would guess that implementing "unicode equivalence"  would not be
too hard---supposedly the unicode standard specifies a "text
normalization procedure".  However, this would complicate the report
specification, because now the alphabet becomes not just unicode
code-points, but equivalence classes of code points.
Thoughts?
-Iavor
On Fri, Mar 16, 2012 at 4:49 PM, Ian Lynagh  wrote:
...
Hi Gaby,
On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:
...
OK, thanks!  I guess a take away from this discussion is that what
is a punctuation is far less well defined than it appears...
I'm not really sure what you're asking. Haskell's uniSymbol includes all
Unicode characters (should that be codepoints? I'm not a Unicode expert)
in the punctuation category; I'm not sure what the best reference is,
but e.g. table 12 in
   http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values
lists a number of Px categories, and a meta-category P "Punctuation".
Thanks
Ian
_______________________________________________
Haskell-prime mailing list
Haskell-prime@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-prime