What is a punctuation character?

Gabriel Dos Reis

16 Mar 2012 16 Mar '12

6:08 p.m.

Hi, The lexical structure chapter defines the non-terminal uniSymbol as uniSymbol ::= any Unicode symbol or punctuation There is a slight ambiguity here: is that description supposed to be parsed as: (a) "Unicode (symbol or punctuation)", or (b) "(Unicode symbol) or punctuation"? If (b), then what qualifies as "punctuation"? As far as I can tell, that is not defined anywhere in the Report. Is it "punctuation" in the basic ASCII charset or in the extended ASCII charset? Everywhere else the Report has been careful in listing which ASCII characters are meant. Thanks, -- Gaby

Show replies by date

Brandon Allbery

16 Mar 16 Mar

6:18 p.m.

On Fri, Mar 16, 2012 at 14:08, Gabriel Dos Reis < gdr@integrable-solutions.net> wrote:

...

The lexical structure chapter defines the non-terminal uniSymbol as

uniSymbol ::= any Unicode symbol or punctuation

There is a slight ambiguity here: is that description supposed to be parsed as: (a) "Unicode (symbol or punctuation)", or (b) "(Unicode symbol) or punctuation"?

(a) and I thought the report specified that the language's lexemes are defined in terms of Unicode properties so (a) is the only meaningful interpretation. (b) is not particularly meaningful, as your own question demonstrates. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

Gabriel Dos Reis

6:30 p.m.

On Fri, Mar 16, 2012 at 1:18 PM, Brandon Allbery wrote:

...

On Fri, Mar 16, 2012 at 14:08, Gabriel Dos Reis wrote:

...
The lexical structure chapter defines the non-terminal uniSymbol as

uniSymbol ::= any Unicode symbol or punctuation

There is a slight ambiguity here: is that description supposed to be parsed as: (a) "Unicode (symbol or punctuation)", or (b) "(Unicode symbol) or punctuation"?

(a) and I thought the report specified that the language's lexemes are defined in terms of Unicode properties so (a) is the only meaningful interpretation. (b) is not particularly meaningful, as your own question demonstrates.

It is not clear what "the language's lexemes are defined in terms of Unicode properties" really means. Why would you need ascSmall (and similar ASCII character categories) then when you already have uniSmall and associates? It is not clear that (b) is all that "not particularly meaningful". Have a look at the production <symbol>: it excludes double quote(") and apostrophe (') from uniSymbol. -- Gaby

Brandon Allbery

6:49 p.m.

On Fri, Mar 16, 2012 at 14:30, Gabriel Dos Reis < gdr@integrable-solutions.net> wrote:

...

It is not clear what "the language's lexemes are defined in terms of Unicode properties" really means. Why would you need ascSmall (and similar ASCII character categories) then when you already have uniSmall and associates?

I have to assume that is a leftover from an earlier version of the report, because it is indeed already included. See in section 2.1: "Haskell uses the Unicode [11http://www.haskell.org/onlinereport/haskell.html#$unicode] character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell." I understand this to indicate that Unicode character classes are intended, and it does indeed hint that references to ASCII are references to older versions of the language (and should probably be considered fossils, as ASCII itself is; the American Standard Code for Information Interchange was obsoleted by ISO 8859, and modern references to "ASCII" usually should be taken to mean "ISO 8859/1").

...

It is not clear that (b) is all that "not particularly meaningful". Have a look at the production <symbol>: it excludes double quote(") and apostrophe (') from uniSymbol.

The notion of "symbol with certain lexicals that have other meanings *that are specified elsewhere in the report*" is not precise enough? It may be difficult to characterize things with your required precision, since every general statement will necessarily have to carry part or potentially all of the entire Report within it if it is not sufficient to use the statement's context (as describing some part of the Report). -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

Gabriel Dos Reis

7:20 p.m.

On Fri, Mar 16, 2012 at 1:49 PM, Brandon Allbery wrote:

...

On Fri, Mar 16, 2012 at 14:30, Gabriel Dos Reis wrote:

...
It is not clear what "the language's lexemes are defined in terms of Unicode properties" really means. Why would you need ascSmall (and similar ASCII character categories) then when you already have uniSmall and associates?

I have to assume that is a leftover from an earlier version of the report, because it is indeed already included.

I believe this part has seen very little change from the Revised Haskell 98 Report. It is not clear that it is an unintended leftover. Section 2.1 that you quote below is the same as in the (Revised) Haskell 98 report.

...

See in section 2.1:

"Haskell uses the Unicode [11] character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell."

I understand this to indicate that Unicode character classes are intended, and it does indeed hint that references to ASCII are references to older versions of the language (and should probably be considered fossils, as ASCII itself is; the American Standard Code for Information Interchange was obsoleted by ISO 8859, and modern references to "ASCII" usually should be taken to mean "ISO 8859/1").

Unicode support is clearly intended. Also clearly, ASCII support is intended. However, the Report does not say what the concrete syntax of a Unicode character should be. (At least I have been unable to find it from the report.)

...

...
It is not clear that (b) is all that "not particularly meaningful". Have a look at the production <symbol>: it excludes double quote(") and apostrophe (') from uniSymbol.

The notion of "symbol with certain lexicals that have other meanings *that are specified elsewhere in the report*" is not precise enough? It may be difficult to characterize things with your required precision, since every general statement will necessarily have to carry part or potentially all of the entire Report within it if it is not sufficient to use the statement's context (as describing some part of the Report).

Well, I hope nobody is suggesting that it is unreasonable to require precision of a language definition -- especially of Haskell! :-) A problem with "use the statement's context" is that the context themselves are not unquestionably unambiguous -- which is part of the reason we are having this conversation in the first place. That being said, I am not sure how the passage you quote applies here or answers conclusively the original questions. Where else is punctutation defined in the Report? What is the concrete syntax of a punctuation? If you were going to write a lexer and a parser for Haskell, how you would recognize a character as a punctuation? -- Gaby

Brandon Allbery

8:22 p.m.

On Fri, Mar 16, 2012 at 15:20, Gabriel Dos Reis < gdr@integrable-solutions.net> wrote:

...

I believe this part has seen very little change from the Revised Haskell 98 Report.

I was in fact looking at the Haskell 98 report at the time.

...

It is not clear that it is an unintended leftover. Section 2.1 that

Nothing is ever clear. This useless pedanticism being stipulated, there is no purpose to a completely overlapping category unless it is intended to relate to an earlier standard (say Haskell 1.4).

...

Unicode support is clearly intended. Also clearly, ASCII support is intended. However, the Report does not say what the concrete syntax of a Unicode character should be. (At least I have been unable to find it from the report.)

Maybe what needs to be pedantically specified is that the link to the Unicode standard is intended to be inclusion of that standard by reference (the [11] in the section I quoted is an endnote referencing the Unicode standard) and not merely informational. Or are you insisting we are not precise enough unless we enumerate all the Unicode characters explicitly in the Haskell standard? -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

Gabriel Dos Reis

9:15 p.m.

On Fri, Mar 16, 2012 at 3:22 PM, Brandon Allbery wrote:

...

On Fri, Mar 16, 2012 at 15:20, Gabriel Dos Reis wrote:

...
I believe this part has seen very little change from the Revised Haskell 98 Report.

I was in fact looking at the Haskell 98 report at the time.

...
It is not clear that it is an unintended leftover. Section 2.1 that

Nothing is ever clear. This useless pedanticism being stipulated, there is

I very much appreciate any clarification you have on the topic. However, I believe we do best when we leave phrases like "useless pedanticism" or "pedantically" out. They are rarely constructive and no substance to an otherwise informative discussion. At best, they would distract us. (In matter of programming language definition, "pedanticism" should be the least of our worries -- and it probably should not come with a modifier such as "useless", we should probably wear it as badge of honor.)

...

no purpose to a completely overlapping category unless it is intended to relate to an earlier standard (say Haskell 1.4).

which in itself is not an unambiguous interpretation :-)

...

...
Unicode support is clearly intended. Also clearly, ASCII support is intended. However, the Report does not say what the concrete syntax of a Unicode character should be. (At least I have been unable to find it from the report.)

Maybe what needs to be pedantically specified is that the link to the Unicode standard is intended to be inclusion of that standard by reference (the [11] in the section I quoted is an endnote referencing the Unicode standard) and not merely informational. Or are you insisting we are not precise enough unless we enumerate all the Unicode characters explicitly in the Haskell standard?

Giving a link to the Unicode standard does not really help with the original questions. I know where to find the Unicode standard; that wasn't the issue. One of the underlying questions is: what is the concrete syntax of a Unicode character in a Haskell program? Note that Chapter 2 goes to a great pain to specify the ASCII concrete syntax. To put things in perspective, have look at this specification of programs supposed to be written using Unicode characters. http://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.2 -- Gaby

Malcolm Wallace

11 p.m.

...

...
no purpose to a completely overlapping category unless it is intended to relate to an earlier standard (say Haskell 1.4).

I believe all Haskell Reports, even since 1.0, have specified that the language "uses" Unicode. If it helps to bring perspective to this discussion, it is my impression that the initial designers of Haskell did not know very much about Unicode, but wanted to avoid the trap of being stuck with ASCII-only, and so decided to reference "whatever Unicode does", as the most obvious and unambiguous way of not having to think about (or specify) these lexical issues themselves.

...

One of the underlying questions is: what is the concrete syntax of a Unicode character in a Haskell program? Note that Chapter 2 goes to a great pain to specify the ASCII concrete syntax.

In my view, the Haskell Report is deliberately agnostic on concrete syntax for Unicode, believing that to be outside the scope of a programming language standard, whilst entirely within the scope of the Unicode standards body. Seeing as there are (in practice) numerous concrete representations of Unicode (UTF-8 and other encodings), it is largely up to individual compiler implementations which encodings they support for (a) source text, and (b) input/output at runtime. Regards, Malcolm

Gabriel Dos Reis

11:29 p.m.

On Fri, Mar 16, 2012 at 6:00 PM, Malcolm Wallace wrote:

...

...
...
no purpose to a completely overlapping category unless it is intended to relate to an earlier standard (say Haskell 1.4).

I believe all Haskell Reports, even since 1.0, have specified that the language "uses" Unicode. If it helps to bring perspective to this discussion, it is my impression that the initial designers of Haskell did not know very much about Unicode, but wanted to avoid the trap of being stuck with ASCII-only, and so decided to reference "whatever Unicode does", as the most obvious and unambiguous way of not having to think about (or specify) these lexical issues themselves.

OK.

...

...
One of the underlying questions is: what is the concrete syntax of a Unicode character in a Haskell program? Note that Chapter 2 goes to a great pain to specify the ASCII concrete syntax.

In my view, the Haskell Report is deliberately agnostic on concrete syntax for Unicode, believing that to be outside the scope of a programming language standard, whilst entirely within the scope of the Unicode standards body.

The trouble is the Unicode standards body believes that the concrete syntax is entirely within the scope of the programming language definition (or any client using Unicode characters), whilst largely restricting itself to the talking about code points which are more abstract. So, the trick of reference the Unicode standards is not satisfactory :-(

...

Seeing as there are (in practice) numerous concrete representations of Unicode (UTF-8 and other encodings), it is largely up to individual compiler implementations which encodings they support for (a) source text, and (b) input/output at runtime.

OK, thanks! I guess a take away from this discussion is that what is a punctuation is far less well defined than it appears... A common practice (exemplified by the link I gave earlier) is to restrict the concrete -syntax- of the input program to the ASCII charset, and use Unicode escape sequences to include the entire Unicode charset. It is common to use \uNNNNNN or \UNNNNNN to introduce Unicode characters, but I suspect that is out of question for Haskell programs because it would clash with lambda abstraction. -- Gaby

Ian Lynagh

11:49 p.m.

Hi Gaby, On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:

...

OK, thanks! I guess a take away from this discussion is that what is a punctuation is far less well defined than it appears...

I'm not really sure what you're asking. Haskell's uniSymbol includes all Unicode characters (should that be codepoints? I'm not a Unicode expert) in the punctuation category; I'm not sure what the best reference is, but e.g. table 12 in http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values lists a number of Px categories, and a meta-category P "Punctuation". Thanks Ian

Iavor Diatchki

17 Mar 17 Mar

1:23 a.m.

Hello, I am also not an expert but I got curious and did a bit of Wikipedia reading. Based on what I understood, here are two (related) questions that it might be nice to clarify in a future version of the report: 1. What is the alphabet used by the grammar in the Haskell report? My understanding is that the intention is that the alphabet is unicode codepoints (sometimes referred to as unicode characters). There is no way to refer to specific code-points by escaping as in Java (the link that Gaby shared), you just have to write the code-points directly (and there are plenty of encodings for doing that, e.g. UTF-8 etc.) 2. Do we respect "unicode equivalence" (http://en.wikipedia.org/wiki/Canonical_equivalence) in Haskell source code. The issue here is that, apparently, some sequences of unicode code points/characters are supposed to be morally the same. For example, it would appear that there are two different ways to write the Spanish letter ñ: it has its own number, but it can also be made by writing "n" followed by a modifier to put the wavy sign on top. I would guess that implementing "unicode equivalence" would not be too hard---supposedly the unicode standard specifies a "text normalization procedure". However, this would complicate the report specification, because now the alphabet becomes not just unicode code-points, but equivalence classes of code points. Thoughts? -Iavor On Fri, Mar 16, 2012 at 4:49 PM, Ian Lynagh wrote:

...

Hi Gaby,

On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:

...
OK, thanks! I guess a take away from this discussion is that what is a punctuation is far less well defined than it appears...

I'm not really sure what you're asking. Haskell's uniSymbol includes all Unicode characters (should that be codepoints? I'm not a Unicode expert) in the punctuation category; I'm not sure what the best reference is, but e.g. table 12 in http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values lists a number of Px categories, and a meta-category P "Punctuation".

Thanks Ian

_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime

Colin Paul Adams

19 Mar 19 Mar

11:42 a.m.

Iavor> report? My understanding is that the intention is that the Iavor> alphabet is unicode codepoints (sometimes referred to as Iavor> unicode characters). Unicode characters are not the same as Unicode codepoints. What we want is Unicode characters. We don't want to be able to write a Unicode codepoint, as that would permit writing half of a surrogate pair, which is malformed Unicode. -- Colin Adams Preston Lancashire () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

Iavor Diatchki

20 Mar 20 Mar

10:37 p.m.

Hello, So I looked at what GHC does with Unicode and to me it is seems quite reasonable: * The alphabet is Unicode code points, so a valid Haskell program is simply a list of those. * Combining characters are not allowed in identifiers, so no need for complex normalization rules: programs should always use the "short" version of a character, or be rejected. * Combining characters may appear in string literals, and there they are left "as is" without any modification (so some string literals may be longer than what's displayed in a text editor.) Perhaps this is simply what the report already states (I haven't checked, for which I apologize) but, if not, perhaps we should clarify things. -Iavor PS: I don't think that there is any need to specify a particular representation for the unicode code-points (e.g., utf-8 etc.) in the language standard. On Fri, Mar 16, 2012 at 6:23 PM, Iavor Diatchki wrote:

...

Hello, I am also not an expert but I got curious and did a bit of Wikipedia reading. Based on what I understood, here are two (related) questions that it might be nice to clarify in a future version of the report:

1. What is the alphabet used by the grammar in the Haskell report? My understanding is that the intention is that the alphabet is unicode codepoints (sometimes referred to as unicode characters). There is no way to refer to specific code-points by escaping as in Java (the link that Gaby shared), you just have to write the code-points directly (and there are plenty of encodings for doing that, e.g. UTF-8 etc.)

2. Do we respect "unicode equivalence" (http://en.wikipedia.org/wiki/Canonical_equivalence) in Haskell source code. The issue here is that, apparently, some sequences of unicode code points/characters are supposed to be morally the same. For example, it would appear that there are two different ways to write the Spanish letter ñ: it has its own number, but it can also be made by writing "n" followed by a modifier to put the wavy sign on top.

I would guess that implementing "unicode equivalence" would not be too hard---supposedly the unicode standard specifies a "text normalization procedure". However, this would complicate the report specification, because now the alphabet becomes not just unicode code-points, but equivalence classes of code points.

Thoughts?

-Iavor

On Fri, Mar 16, 2012 at 4:49 PM, Ian Lynagh wrote:

...
Hi Gaby,

On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:

...
OK, thanks! I guess a take away from this discussion is that what is a punctuation is far less well defined than it appears...

I'm not really sure what you're asking. Haskell's uniSymbol includes all Unicode characters (should that be codepoints? I'm not a Unicode expert) in the punctuation category; I'm not sure what the best reference is, but e.g. table 12 in http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values lists a number of Px categories, and a meta-category P "Punctuation".

Thanks Ian

_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime

Gabriel Dos Reis

11:17 p.m.

On Tue, Mar 20, 2012 at 5:37 PM, Iavor Diatchki wrote:

...

Hello,

So I looked at what GHC does with Unicode and to me it is seems quite reasonable:

* The alphabet is Unicode code points, so a valid Haskell program is simply a list of those. * Combining characters are not allowed in identifiers, so no need for complex normalization rules: programs should always use the "short" version of a character, or be rejected. * Combining characters may appear in string literals, and there they are left "as is" without any modification (so some string literals may be longer than what's displayed in a text editor.)

Perhaps this is simply what the report already states (I haven't checked, for which I apologize) but, if not, perhaps we should clarify things.

-Iavor PS: I don't think that there is any need to specify a particular representation for the unicode code-points (e.g., utf-8 etc.) in the language standard.

Thanks Iavor. If the report intended to talk about code points only (and indeed ruling out normalization suggests that), then the Report needs to be clarified. As you know, there is a distinction between a Unicode code point and a Unicode character http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf#G25564 Until I sent my original query, I had been reading the Report as meaning Unicode characters (as the grammar seemed to suggest), but now it is clear to me that only code points were intended. That seemed to be confirmed by your investigation of the GHC code base. -- Gaby

Gabriel Dos Reis

17 Mar 17 Mar

4:37 p.m.

On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh wrote:

...

Hi Gaby,

On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:

...
OK, thanks! I guess a take away from this discussion is that what is a punctuation is far less well defined than it appears...

I'm not really sure what you're asking. Haskell's uniSymbol includes all Unicode characters (should that be codepoints? I'm not a Unicode expert) in the punctuation category; I'm not sure what the best reference is, but e.g. table 12 in http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values lists a number of Px categories, and a meta-category P "Punctuation".

Thanks Ian

Hi Ian, I guess what I am asking was partly summarized in Iavor's message. For me, the issue started with bullet number 4 in section 1.1 http://www.haskell.org/onlinereport/intro.html#sect1.1 which states that: The lexical structure captures the concrete representation of Haskell programs in text files. That combined with the opening section 2.1 (e.g. example of terminal syntax) and the fact that the grammar routinely described two non-terminals ascXXX (for ASCII characters) and uniXXX for (Unicode character) suggested that the concrete syntax of Haskell programs in text files is in ASCII charset. Note this does not conflict with the general statement that Haskell programs use the Unicode character because the uniXXX could use the ASCII charset to introduce Unicode characters -- this is not uncommon practice for programming languages using Unicode characters; see the link I gave earlier. However, if I understand Malcolm's message correctly, this is not the case. Contrary to what I quoted above, Chapter 2 does NOT specify the concrete representation of Haskell programs in text files. What it does is to capture the structure of what is obtained from interpreting, *in some unspecified encoding or unspecified alphabet*, the concrete representation of Haskell programs in text files. This conclusion is unfortunate, but I believe it is correct. Since the encoding or the alphabet is unspecified, it is no longer necessarily the case that two Haskell implementations would agree on the same lexical interpretation when presented with the same exact text file containing a Haskell program. In its current form, you are correct that the Report should say "codepoint" instead of characters. I join Iavor's request in clarifying the alphabet used in the grammar. Thanks, -- Gaby

Simon Marlow

19 Mar 19 Mar

9:34 a.m.

...

On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh wrote:

...
Hi Gaby,

On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:

...
OK, thanks! I guess a take away from this discussion is that what is a punctuation is far less well defined than it appears...

I'm not really sure what you're asking. Haskell's uniSymbol includes all Unicode characters (should that be codepoints? I'm not a Unicode expert) in the punctuation category; I'm not sure what the best reference is, but e.g. table 12 in http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values lists a number of Px categories, and a meta-category P "Punctuation".

Thanks Ian

Hi Ian,

I guess what I am asking was partly summarized in Iavor's message.

For me, the issue started with bullet number 4 in section 1.1

http://www.haskell.org/onlinereport/intro.html#sect1.1

which states that:

The lexical structure captures the concrete representation of Haskell programs in text files.

That combined with the opening section 2.1 (e.g. example of terminal syntax) and the fact that the grammar routinely described two non- terminals ascXXX (for ASCII characters) and uniXXX for (Unicode character) suggested that the concrete syntax of Haskell programs in text files is in ASCII charset. Note this does not conflict with the general statement that Haskell programs use the Unicode character because the uniXXX could use the ASCII charset to introduce Unicode characters -- this is not uncommon practice for programming languages using Unicode characters; see the link I gave earlier.

However, if I understand Malcolm's message correctly, this is not the case. Contrary to what I quoted above, Chapter 2 does NOT specify the concrete representation of Haskell programs in text files. What it does is to capture the structure of what is obtained from interpreting, *in some unspecified encoding or unspecified alphabet*, the concrete representation of Haskell programs in text files. This conclusion is unfortunate, but I believe it is correct. Since the encoding or the alphabet is unspecified, it is no longer necessarily the case that two Haskell implementations would agree on the same lexical interpretation when presented with the same exact text file containing a Haskell program.

In its current form, you are correct that the Report should say "codepoint" instead of characters.

I join Iavor's request in clarifying the alphabet used in the grammar.

The report gives meaning to a sequence of codepoints only, it says nothing about how that sequence of codepoints is represented as a string of bytes in a file, nor does it say anything about what those files are called, or even whether there are files at all. Perhaps some clarification is in order in a future revision, and we should use the correct terminology where appropriate. We should also clarify that "punctuation" means exactly the Punctuation class. With regards to normalisation and equivalence, my understanding is that Haskell does not support either: two identifiers are equal if and only if they are represented by the same sequence of codepoints. Again, we could add a clarifying sentence to the report. Cheers, Simon

Gabriel Dos Reis

9:56 a.m.

On Mon, Mar 19, 2012 at 4:34 AM, Simon Marlow wrote:

...

...
On Fri, Mar 16, 2012 at 6:49 PM, Ian Lynagh wrote:

...
Hi Gaby,

On Fri, Mar 16, 2012 at 06:29:24PM -0500, Gabriel Dos Reis wrote:

...
OK, thanks! I guess a take away from this discussion is that what is a punctuation is far less well defined than it appears...

I'm not really sure what you're asking. Haskell's uniSymbol includes all Unicode characters (should that be codepoints? I'm not a Unicode expert) in the punctuation category; I'm not sure what the best reference is, but e.g. table 12 in http://www.unicode.org/reports/tr44/tr44-8.html#Property_Values lists a number of Px categories, and a meta-category P "Punctuation".

Thanks Ian

Hi Ian,

I guess what I am asking was partly summarized in Iavor's message.

For me, the issue started with bullet number 4 in section 1.1

http://www.haskell.org/onlinereport/intro.html#sect1.1

which states that:

The lexical structure captures the concrete representation of Haskell programs in text files.

That combined with the opening section 2.1 (e.g. example of terminal syntax) and the fact that the grammar routinely described two non- terminals ascXXX (for ASCII characters) and uniXXX for (Unicode character) suggested that the concrete syntax of Haskell programs in text files is in ASCII charset. Note this does not conflict with the general statement that Haskell programs use the Unicode character because the uniXXX could use the ASCII charset to introduce Unicode characters -- this is not uncommon practice for programming languages using Unicode characters; see the link I gave earlier.

However, if I understand Malcolm's message correctly, this is not the case. Contrary to what I quoted above, Chapter 2 does NOT specify the concrete representation of Haskell programs in text files. What it does is to capture the structure of what is obtained from interpreting, *in some unspecified encoding or unspecified alphabet*, the concrete representation of Haskell programs in text files. This conclusion is unfortunate, but I believe it is correct. Since the encoding or the alphabet is unspecified, it is no longer necessarily the case that two Haskell implementations would agree on the same lexical interpretation when presented with the same exact text file containing a Haskell program.

In its current form, you are correct that the Report should say "codepoint" instead of characters.

I join Iavor's request in clarifying the alphabet used in the grammar.

The report gives meaning to a sequence of codepoints only, it says nothing about how that sequence of codepoints is represented as a string of bytes in a file, nor does it say anything about what those files are called, or even whether there are files at all.

Thanks, Simon. The fact that the Report is silent about encoding used to represent concrete Haskell programs in text files adds a certain level of non-portability (and confusion.) I found last night that a proposal has been made to add some support for encoding specification http://hackage.haskell.org/trac/haskell-prime/wiki/UnicodeInHaskellSource I believe that is a good start. What are the odds of it being considered for Haskell 2012? I suspect the pragma proposal works only if something is said about the position of that pragma in the source file (e.g. it must be the first line, or file N bytes in the source file) otherwise we have an infinite descent.

...

Perhaps some clarification is in order in a future revision, and we should use the correct terminology where appropriate. We should also clarify that "punctuation" means exactly the Punctuation class.

That would be great. Do you have any comment about the UnicodeInHaskellSource proposal?

...

With regards to normalisation and equivalence, my understanding is that Haskell does not support either: two identifiers are equal if and only if they are represented by the same sequence of codepoints. Again, we could add a clarifying sentence to the report.

Ugh. Writing a parser for Haskell was an interesting exercise :-) -- Gaby

Brandon Allbery

10:36 a.m.

On Mon, Mar 19, 2012 at 05:56, Gabriel Dos Reis < gdr@integrable-solutions.net> wrote:

...

The fact that the Report is silent about encoding used to represent concrete Haskell programs in text files adds a certain level of non-portability (and confusion.) I found

Specifying the encoding can *also* limit portability, if you specify an encoding that is not widely supported on some target platform. (Please try to remember that the universe is not composed solely of Windows and Linux. The fact that those are the only ones you care about is not relevant to the standard; nor is the list of platforms that GHC or any other implementation supports.) Encoding does not belong in the language standard; it is an aspect of implementing the language standard on a given platform. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

Gabriel Dos Reis

10:38 a.m.

On Mon, Mar 19, 2012 at 5:36 AM, Brandon Allbery wrote:

...

On Mon, Mar 19, 2012 at 05:56, Gabriel Dos Reis wrote:

...
The fact that the Report is silent about encoding used to represent concrete Haskell programs in text files adds a certain level of non-portability (and confusion.) I found

Specifying the encoding can *also* limit portability, if you specify an encoding that is not widely supported on some target platform.

That is why I find the pragma suggestion attractive. -- Gaby

4855

Age (days ago)

4859

Last active (days ago)

List overview

Download

18 comments

7 participants

participants (7)

Brandon Allbery
Colin Paul Adams
Gabriel Dos Reis
Ian Lynagh
Iavor Diatchki
Malcolm Wallace
Simon Marlow