RE: [Haskell-i18n] Unicode in source

On Wed, 2002-08-21 at 12:02, Simon Marlow wrote:
Apparently, this isn't quite supported by GHC:
Prelude> map Char.ord "\74\749\7490" [74,237,66]
which is, of course, the values modulo 256.
I think you've found a bug. [...]
Oh, oops. :)
(aside: aren't there problems with Unicode not being a fixed-width character set? Some characters are expected to combine with others to form a glyph, there are multiple versions of some characters with different widths, there are several widths of space, etc.)
I think (...) these issues should not pose a problem. variable-width characters: Unicode specifically doesn't say anything about the glyph representation of the characters. So it is reasonable to assume there will be fixed-width unicode character sets. Remember that even our latin alphabet has characters of different width (i vs. w) which we just somehow manage to fit into glyphs of the same width. If one's editor would really use a variable-width font he'll already have the problem with ASCII. composition characters: I think we should interpret each character in the source as exactly one and leave any possible composition to the level of editing tools. The way I imagine the use of these composition characters is, for instance, as keyboard input to an editor which then composes them into a single char before writing anything to a file. I'd say this issue belongs to the domain of text processing. Regards, Sven Moritz

Sven Moritz Hallberg wrote:
(aside: aren't there problems with Unicode not being a fixed-width character set? Some characters are expected to combine with others to form a glyph, there are multiple versions of some characters with different widths, there are several widths of space, etc.)
I think (...) these issues should not pose a problem.
variable-width characters: Unicode specifically doesn't say anything about the glyph representation of the characters. So it is reasonable to assume there will be fixed-width unicode character sets. Remember that even our latin alphabet has characters of different width (i vs. w) which we just somehow manage to fit into glyphs of the same width. If one's editor would really use a variable-width font he'll already have the problem with ASCII.
For fonts which aren't restricted to Western alphabets, there are two common interpretations of "fixed width". One interpretation is that all glyphs are exactly the same width, so even "narrow" characters ("l", "i", "1") are as wide as the widest CJK characters. Many users will dislike such fonts; apart from looking rather odd, they also waste screen space. The other interpretation is that all glyphs have widths which are an integral number of "columns". Western (latin, cyrillic, Greek) characters are a single column wide, while CJK characters are typically two columns wide. The (Unix98) wcwidth() function can be used to obtain the width (in columns) of a given wide character (wchar_t) in the current locale.
composition characters: I think we should interpret each character in the source as exactly one and leave any possible composition to the level of editing tools. The way I imagine the use of these composition characters is, for instance, as keyboard input to an editor which then composes them into a single char before writing anything to a file. I'd say this issue belongs to the domain of text processing.
Character I/O functions should probably ignore composition, i.e.
LATIN_SMALL_LETTER_A + COMBINING_ACUTE_ACCENT should appear as two
separate characters to the application.
However, layout will only "work" if the compiler (or is it a
preprocessor?) uses the same algorithm as the editor. If the editor
shows a composition sequence as a single character cell, it needs to
be treated as a single column for the purposes of layout.
--
Glynn Clements

On Wed, 2002-08-21 at 23:55, Glynn Clements wrote:
The other interpretation is that all glyphs have widths which are an integral number of "columns". Western (latin, cyrillic, Greek) characters are a single column wide, while CJK characters are typically two columns wide. The (Unix98) wcwidth() function can be used to obtain the width (in columns) of a given wide character (wchar_t) in the current locale.
I see, I wasn't aware of this, thanks for pointing it out. In this case, we should get some way of obtaining the width in columns of a Char in Haskell and let the layout rule talk about columns, correct?
Character I/O functions should probably ignore composition, i.e. LATIN_SMALL_LETTER_A + COMBINING_ACUTE_ACCENT should appear as two separate characters to the application.
However, layout will only "work" if the compiler (or is it a preprocessor?) uses the same algorithm as the editor. If the editor shows a composition sequence as a single character cell, it needs to be treated as a single column for the purposes of layout.
Can the composition characters stand alone at all? If there's no (strong enough) reason to believe they will ever be meant to count as an extra column in the layout rule, we just have to decide whether we want to require compilers to recognize them. Ashley, do your property tools include something that can handle composition? Regards, Sven Moritz

On Wed, 2002-08-21 at 23:55, Glynn Clements wrote:
The other interpretation is that all glyphs have widths which are an integral number of "columns". Western (latin, cyrillic, Greek) characters are a single column wide, while CJK characters are typically two columns wide. The (Unix98) wcwidth() function can be used to obtain the width (in columns) of a given wide character (wchar_t) in the current locale.
The (Unix98) wcwidth() function can be used to obtain the width (in columns) of a given wide character (wchar_t) in the current locale. ^^^^^^^^^^^^^^^^^^^^^
Can you clarify whether this is because the mapping of wchar_ts onto Unicode code points depends on the locale or whether the width of a Unicode code point depends on the locale. [I'm trying to understand whether we can be sure that the number of columns is portable so that different machines will apply the layout rule consistently (assuming the translation into UTF8 is consistent).] -- Alastair Reid alastair@reid-consulting-uk.ltd.uk Reid Consulting (UK) Limited http://www.reid-consulting-uk.ltd.uk/alastair/

Alastair Reid wrote:
The other interpretation is that all glyphs have widths which are an integral number of "columns". Western (latin, cyrillic, Greek) characters are a single column wide, while CJK characters are typically two columns wide. The (Unix98) wcwidth() function can be used to obtain the width (in columns) of a given wide character (wchar_t) in the current locale.
The (Unix98) wcwidth() function can be used to obtain the width (in columns) of a given wide character (wchar_t) in the current locale. ^^^^^^^^^^^^^^^^^^^^^ Can you clarify whether this is because the mapping of wchar_ts onto Unicode code points depends on the locale or whether the width of a Unicode code point depends on the locale.
It's basically the former, although Unicode doesn't come into it
directly. The locale (specifically, the LC_CTYPE category) determines
the character encoding, and hence the "meaning" of any given wchar_t
(or char) value.
--
Glynn Clements

Sven Moritz Hallberg wrote:
The other interpretation is that all glyphs have widths which are an integral number of "columns". Western (latin, cyrillic, Greek) characters are a single column wide, while CJK characters are typically two columns wide. The (Unix98) wcwidth() function can be used to obtain the width (in columns) of a given wide character (wchar_t) in the current locale.
I see, I wasn't aware of this, thanks for pointing it out. In this case, we should get some way of obtaining the width in columns of a Char in Haskell and let the layout rule talk about columns, correct?
I would think so. Although it might be preferable to simply require
line breaks, so that you only need to deal with spaces.
My suspicion is that the existing layout rules were decided with an
implicit assumption of "one character equals one column". If that
ceases to be the case, maybe the decision should be revisited.
--
Glynn Clements

The other interpretation is that all glyphs have widths which are an integral number of "columns". Western (latin, cyrillic, Greek) characters are a single column wide, while CJK characters are typically two columns wide. The (Unix98) wcwidth() function can be used to obtain the width (in columns) of a given wide character (wchar_t) in the current locale.
I see, I wasn't aware of this, thanks for pointing it out. In this case, we should get some way of obtaining the width in columns of a Char in Haskell and let the layout rule talk about columns, correct?
I would think so. Although it might be preferable to simply require line breaks, so that you only need to deal with spaces.
My suspicion is that the existing layout rules were decided with an implicit assumption of "one character equals one column". If that ceases to be the case, maybe the decision should be revisited.
Allowing characters to span more than one column wouldn't break the layout rule, as long as the character to column mapping is generally agreed upon across editors and locales. (I think we established that this is not necessarily the always case, although in practice it should be). Requiring a newline before a new layout context would break *a lot* of code. You can't write 'let x = 42 in x + 1' for example. Sure, a refinement could be made to allow these kind of things, but this will serve to make the layout rule more complex, rather than less. So to extend gracefully while keeping backwards compatibility, I propose: - There be a fixed character->column mapping - Tab stops are every 8 columns - We recommend that programmers avoid using indentation levels which depend on the widths of non-space characters. Obeying the third requirement means that your code will look fine in a proportional font. The compiler could warn about violations quite easily. Note that it is ok to write 'let x = 42 in x + 1', because the meaning of the code doesn't depend on the actual indentation level of the first 'x'. Cheers, Simon

"Simon Marlow"
Requiring a newline before a new layout context would break *a lot* of code. You can't write 'let x = 42 in x + 1' for example.
(I'm not sure this is a good example, as 'in' is (usually?) followed by a single expression. But one could write 'x + 1 where x = 42', and the point is still there.)
Sure, a refinement could be made to allow these kind of things,
Couldn't it, though? This looks very similar to Python's if x==y: foo x or, the alternative multi-statement case if x==y: foo x bar y AFAIK, there's no column-counting middle ground akin to if x==y: foo x bar y I think it would be a good idea to *allow* the current rules, but *recommend* that blocks (of more than one line) are made up by lines indented with whitespace, and whitespace only. Ideally, I'd like a compiler warning for this, but at least we should warn when multi-(or unknown-)column characters may affect indentation levels? -kzm -- If I haven't seen further, it is by standing in the footprints of giants

I think it would be a good idea to *allow* the current rules, but *recommend* that blocks (of more than one line) are made up by lines indented with whitespace, and whitespace only. Ideally, I'd like a compiler warning for this, but at least we should warn when multi-(or unknown-)column characters may affect indentation levels?
erm... isn't that exactly what I suggested? - We recommend that programmers avoid using indentation levels which depend on the widths of non-space characters. Ok, I'll re-word it to be a bit more precise: - We recommend that programmers avoid writing code whose syntactic interpretation depends on the widths of non-space characters. which amounts to not writing code like: case x of True -> e1 False -> e2 or case x of {- case 1 -} True -> e1 {- case 2 -} False -> e2 (note that simply requiring a newline before the first declaration wouldn't outlaw the second case, you would also have to require that the line begins with whitespace only). Cheers, Simon
participants (5)
-
Alastair Reid
-
Glynn Clements
-
ketil@ii.uib.no
-
Simon Marlow
-
Sven Moritz Hallberg