Re: [Haskell-i18n] unicode notation \uhhhh implementation

At 2002-08-16 00:46, Ketil Z. Malde wrote:
Personally, I think i'd prefer the identifiers to be renamed to something more sensible than \uXXXX. Would it be possible to use an escape character and the *name* of the glyph/symbol instead?
Sure, but bear in mind Unicode names for characters are quite long, for instance GREEK SMALL LETTER THETA
I'm also not very happy with backslash for escaping (outside of strings/char constants), since it's used for lambda (somebody pointed out this and other problems in a previous mail).
Right, but whatever it is it really should be an ASCII character: the point is to allow representation of all identifiers from 7-bit ASCII. -- Ashley Yakeley, Seattle WA

Ashley Yakeley
Sure, but bear in mind Unicode names for characters are quite long, for instance
GREEK SMALL LETTER THETA
Hmm...yes. My personal preference would be something close to (La)TeX. Although it is perhaps a bit niche, it *is* a standard lhs style (and one which I quite like, too). Would we need to maintain the list manually, then? Perhaps we could standardise Unicode names, but additionally maintain short synonyms it for greek letters and similar mathematical symbols, which I suspect are rather commonly used?
Right, but whatever it is it really should be an ASCII character: the point is to allow representation of all identifiers from 7-bit ASCII.
What's available, really? "~!?$%.,^:;" are taken, along with quotes, numerical symbols and parens. Are '#' and '&' still free? Candidates I can think of might be: 1 &alpha -> similar to HTML entities 2 #alpha -> possible problems with C preprocessor? 3 _alpha -> this is an existing identifier, but would be consistent 4 {alpha} -> also has a meaning already, even if problems should be rather rare 5 {\alpha} -> TeX'y - is it meaningful H98? Okay, shoot 'em down! -kzm -- If I haven't seen further, it is by standing in the footprints of giants

On Fri, 2002-08-16 at 10:26, Ketil Z. Malde wrote:
Ashley Yakeley
writes: Sure, but bear in mind Unicode names for characters are quite long, for instance
GREEK SMALL LETTER THETA
Hmm...yes. My personal preference would be something close to (La)TeX. Although it is perhaps a bit niche, it *is* a standard lhs style (and one which I quite like, too).
I think there would be no harm in having the TeX names for convenience.
Would we need to maintain the list manually, then? Perhaps we could standardise Unicode names, but additionally maintain short synonyms it for greek letters and similar mathematical symbols, which I suspect are rather commonly used?
Would allowing the full Unicode names give an advantage? Something like GREEK_SMALL_LETTER_THETA is almost half a line and might do more harm to the code readability than uhhhh.
Right, but whatever it is it really should be an ASCII character: the point is to allow representation of all identifiers from 7-bit ASCII.
What's available, really? "~!?$%.,^:;" are taken, along with quotes, numerical symbols and parens. Are '#' and '&' still free?
Candidates I can think of might be:
1 &alpha -> similar to HTML entities 2 #alpha -> possible problems with C preprocessor?
How about #uhhhh? There is no C preprocessor directive like that, so it should be safe to run the unicode-preproc before cpp. The only thing is that GHC uses # in identifiers and pragmas, as far as I can see. Can someone comment? Sven Moritz

Sven Moritz Hallberg
Would allowing the full Unicode names give an advantage? Something like GREEK_SMALL_LETTER_THETA is almost half a line and might do more harm to the code readability than uhhhh.
Well, it depends, I suppose. I'm more likely to be able to remember that '#GREEK_SMALL_LETTER_THETA represents an angle than \uXXXX. Although, I of course would prefer '&theta' or '{\theta}' or something like that. It is possible that a shortish list of TeX symbols or HTML entities, or both, would suffice. Readability is one thing, however, I'm not quite sure how layout would be affected with this. I'm often surprised to hear about the problems people experience with layout, it just seems to work for me. (Using Emacs and auto-indent; there's rarely any problem pressing TAB until the right indentation is reached.) However, now it appears that indentation might change, according to encoding used. How do we solve that? The simple solution is to count one Unicode character as one indentation character, but that would mean having alignments visually distorted if we are using other notations. Emacs could probably handle this and display things correctly, but do we want that extra complexity? case t of Rad _ -> foo Deg _ -> bar -- ^visual alignment case &theta of Rad _ -> foo Deg _ -> bar -- ^aligned, but only by counting (Ditto for \uXXXXXXXX, of course) After all, isn't layout intended to make things *easier* to read? I think I'm in favor of requiring a line break when starting a layout block, but I suppose that will break a lot of existing code. (e.g case &theta of Rad _ -> foo Deg _ -> bar and only require more indentation (i.e. leading whitespace) than the preceeding 'case' opening the block. ) -kzm -- If I haven't seen further, it is by standing in the footprints of giants

On Fri, 2002-08-16 at 14:04, Ketil Z. Malde wrote:
Readability is one thing, however, I'm not quite sure how layout would be affected with this. I'm often surprised to hear about the problems people experience with layout, it just seems to work for me. (Using Emacs and auto-indent; there's rarely any problem pressing TAB until the right indentation is reached.)
However, now it appears that indentation might change, according to encoding used. How do we solve that?
The simple solution is to count one Unicode character as one indentation character, but that would mean having alignments visually distorted if we are using other notations. Emacs could probably handle this and display things correctly, but do we want that extra complexity?
case t of Rad _ -> foo Deg _ -> bar
-- ^visual alignment
case &theta of Rad _ -> foo Deg _ -> bar
-- ^aligned, but only by counting
(Ditto for \uXXXXXXXX, of course)
After all, isn't layout intended to make things *easier* to read?
Oh you're right, I hadn't even thought of that. This would be a pain to use, I suppose. I'm starting to feel this whole unicode-preproc thing brings more problems than it solves. I'll try to summarize: The problem we're trying to address is this: Alice develops in Unicode. Bob's system doesn't support Unicode, yet. He wants to help Alice. Unicode escapes allow Bob to recode Alice's code before using it. When he sends it back to her, she'll recode it back again to her Unicode encoding. The problems with the current approach are: - Ambiguous-looking escape pitfalls. - Tiresome recoding between Alice and Bob. - Badly-readable indentation if Alice uses layout as you describe. - Badly-readable code if Alice uses characters for which no shorthands exist. I think it will be better if we drop \uhhhh escapes as Simon suggested in the first place. That would force Alice and Bob to agree on a common source format but save us from all the problems. If Alice wants Bob to help her, she won't have a big problem with dropping Unicode for this particular program (also, there exist no \uhhhh escapes in the real world to date, so we can assume the ability to think about it beforehand). In the long run, Bob might even manage to get a Unicode-capable system, solving everyone's problem. Sven Moritz
participants (3)
-
Ashley Yakeley
-
ketil@ii.uib.no
-
Sven Moritz Hallberg