
I wrote
If we turn to Unicode, how should we read
a â b â c
Maybe someone has a principled way to tell. I don't.
Rustom Mody wrote:
Without claiming to cover all cases, this is a 'principle' If we have: (â) :: a -> a -> b (â) :: b -> b -> c
then â's precedence should be higher than â.
I always have trouble with "higher" and "lower" precedence, because I've used languages where the operator with the bigger number binds tighter and languages where the operator with the bigger number gets to dominate the other. Both are natural enough, but with opposite meanings for "higher". This principle does not explain why * binds tighter than +, which means we need more than one principle. It also means that if OP1 :: a -> a -> b and OP2 :: b -> b -> a then OP1 should be higher than OP2 and OP2 should be higher than OP1, which is a bit of a puzzler, unless perhaps you are advocating a vaguely CGOL-ish asymmetric precedence scheme where the precedence on the left and the precedence on the right can be different. For the record, let me stipulate that I had in mind a situation where OP1, OP2 : a -> a -> a. For example, APL uses the floor and ceiling operators infix to stand for max and min. This principle offers us no help in ordering max and min. Or consider APL again, whence I'll borrow (using ASCII because this is webmail tonight) take, rotate :: Int -> Vector t -> Vector t Haskell applies operator precedence before it does type checking, so how would it know to parse n `take` m `rotate` v as (n `take` (m `rotate` v))? I don't believe there was anything in my original example to suggest that either operator had two operands of the same type, so I must conclude that this principle fails to provide any guidance in that case (like this one).
This is what makes it natural to have the precedences of (+) (<) (&&) in decreasing order.
This is also why the bitwise operators in C have the wrong precedence:
Oh, I agree with that!
The error comes (probably) from treating & as close to the logical operators like && whereas in fact it is more kin to arithmetic operators like +.
The error comes from BCPL where & and && were the same operator (similarly | and ||). At some point in the evolution of C from BCPL the operators were split apart but the bitwise ones left in the wrong place.
There are of course other principles: Dijkstra argued vigorously that boolean algebra being completely symmetric in (â¨,True) (â§, False), â§, ⨠should have the same precedence.
Evidently not too many people agree with him!
Sadly, I am reading this in a web browser where the Unicode symbols are completely garbled. (More precisely, I think it's WebMail doing it.) Maybe Unicode isn't ready for prime time yet? You might be interested to hear that in the Ada programming language, you are not allowed to mix 'and' with 'or' (or 'and then' with 'or else') without using parentheses. The rationale is that the designers did not believe that enough programmers understood the precedence of and/or. The GNU C compiler kvetches when you have p && q || r without otiose parentheses. Seems that there are plenty of designers out there who agree with Dijkstra, not out of a taste for well-engineered notation, but out of contempt for the Average Programmer.
When I studied C (nearly 30 years now!) we used gets as a matter of course. Today we dont.
Hmm. I started with C in late 1979. Ouch. That's 34 and a half years ago. This was under Unix version 6+, with a slightly "pre-classic" C. A little later we got EUC Unix version 7, and a 'classic' C compiler that, oh joy, supported /\ (min) and \/ (max) operators. [With a bug in the code generator that I patched.]
Are Kernighan and Ritchie wrong in teaching it? Are today's teacher's wrong in proscribing it?
I believe the only reasonable outlook is that truth changes with time: it was ok then; its not today.
In this case, bull-dust! gets() is rejected today because a botch in its design makes it bug-prone. Nothing has changed. It was bug-prone 34 years ago. It has ALWAYS been a bad idea to use gets(). Amongst other things, the Unix manuals have always presented the difference between gets() -- discards the terminator -- and fgets() -- annoyingly retains the terminator -- as a bug which they thought it was too late to fix; after all, C had hundreds of users! No, it was obvious way back then: you want to read a line? Fine, WRITE YOUR OWN FUNCTION, because there is NO C library function that does quite what you want. The great thing about C was that you *could* write your own line-reading function without suffering. Not only would your function do the right thing (whatever you conceived that to be), it would be as fast, or nearly as fast, as the built-in one. Try doing *that* in PL/I! No, in this case, *opinions* may have changed, peoples *estimation* of and *tolerance for* the risks may have changed, but the truth has not changed.
Likewise DOCTYPE-missing and charset-other-than-UTF-8. Random example showing how right yesterday becomes wrong today: http://www.sitepoint.com/forums/showthread.php?660779-Content-type-iso-8859-...
Well, "missing" DOCTYPE is where it starts to get a bit technical. An SGML document is basically made up of three parts: - an SGML declaration (meta-meta-data) that tells the parser, amongst other things, what characters to use for delimiters, whether various things are case sensitive, what the numeric limits are, and whether various features are enabled. - a Document Type Declaration (meta-data) that conforms to the lexical rules set up by the SGML declaration and defines (a) the grammar rules and (b) a bunch of macros. - a document (data). The SGML declaration can be supplied to a parser as data (and yes, I've done that), or it can be stipulated by convention (as the HTML standards do). In the same way, the DTD can be - completely declared in-line - defined by reference with local amendments - defined solely by reference - known by convention. If there is a convention that a document without a DTD uses a particular DTD, SGML is fine with that. (It's all part of "entity management", one of the minor arcana of SGML.) As for the link in question, it doesn't show right turning into wrong. A quick summary of the sensible part of that thread: - If you use a <meta> tag to specify the encoding of your file, it had better be *right*. This has been true ever since <meta> tags first existed. - If you have a document in Latin 1 and any characters outside that range are written as character entity references or numeric character references, there is no need to change. No change of right to wrong here! - If you want to use English punctuation marks like dashes and curly quotes, using UTF-8 will let you write these characters without character entities or NCRs. This is only half true. It will let you do this conveniently IF your local environment has fonts that include the characters. (Annoyingly, in Mac OS 10.6, which I'm typing on, Edit|Special characters is not only geographically confused, listing Coptic as a *European* script -- last type I checked Egypt was still in Africa -- but it doesn't display any Coptic characters. In the Mac OS 10.7 system I normally use, Edit|Special characters got dramatically worse as an interface, but no more competent with Coptic characters. Just because a character is in Unicode doesn't mean it can be *used*, practically speaking.) Instead of saying that what is wrong has become or is becoming right, I'd prefer to say that what was impossible is becoming possible and what was broken (Unicode font support) is gradually getting fixed. - Some Unicode characters, indeed, some Latin 1 characters, are so easy to confuse with other characters that it is advisable to use character entities. Again, nothing about wrong turning into right. This was good advice as soon as Latin 1 came out.
Unicode vs ASCII in program source is similar (I believe).
Well, not really. People using specification languages like Z routinely used characters way outside the ASCII range; one way was to use LaTeX. Another way was to have GUI systems that let you key in using LaTeX character names or menus but see the intended characters. Back in about 1984 I was able to use a 16-bit character set on the Xerox Lisp Machines. I've still got a manual for the XNS character set somewhere. In one of the founding documents for the ISO Prolog standard, I recommended, in 1984, that the Prolog standard. That's THREE YEARS before Unicode was a gleam in its founders' eyes. This is NOT new. As soon as there were bit-mapped displays and laser printers, there was pressure to allow a wider range of characters in programs. Let me repeat that: 30 years ago I was able to use non-ASCII characters in computer programs. *Easily*, via virtual keyboards. In 1987, the company I was working at in California revamped their system to handle 16-bit characters and we bought a terminal that could handle Japanese characters. Of course this was because we wanted to sell our system in Japan. But this was shortly before X11 came out; the MIT window system of the day was X10 and the operating system we were using the 16-bit characters on was VMS. That's 27 years ago. This is not new. So what _is_ new? * A single standard. Wait, we DON'T have a single standard. We have a single standard *provider* issuing a rapid series of revisions of an increasingly complex standard, where entire features are first rejected outright, then introduced, and then deprecated again. Unicode 6.3 came out last year with five new characters (bringing the total to 110,122), over a thousand new character *variants*, two new normative properties, and a new BIDI algorithm which I don't yet understand. And Unicode 7.0 is due out in 3 months. Because of this - different people WILL have tools that understand different versions of Unicode. In fact, different tools in the same environment may do this. - your beautiful character WILL show up as garbage or even blank on someone's screen UNLESS it is an old or extremely popular (can you say Emoji? I knew you could. Can you teach me how to say it?) one. - when proposing to exploit Unicode characters, it is VITAL to understand that the Unicode "stability" rules are and which characters have what stable properties. * With large cheap discs, large fonts are looking like a lot less of a problem. (I failed to learn to read the Armenian letters, but do have those. I succeeded in learning to read the Coptic letters -- but not the language(s)! -- but don't have those. Life is not fair.) * We now have (a series of versions of) a standard character set containing a vast number of characters. I very much doubt whether there is any one person who knows all the Unicode characters. * Many of these characters are very similar. I counted 64 "right arrow" characters before I gave up; this didn't include harpoons. Some of these are _very_ similar. Some characters are visibly distinct, but normally regarded as mere stylistic differences. For example, <= has at least three variations (one bar, slanted; one bar, flat; two bars, flat) which people familiar with less than or equal have learned *not* to tell apart. But they are three different Unicode characters, from which we could make three different operators with different precedence or associativity, and of course type.
My thoughts on this (of a philosophical nature) are: http://blog.languager.org/2014/04/unicode-and-unix-assumption.html
If we can get the broader agreements (disagreements!) out of the way to start with, we may then look at the details.
I think Haskell can tolerate an experimental phase where people try out a lot of things as long as everyone understands that it *IS* an experimental phase, and as long as experimental operators are kept out of Hackage, certainly out of the Platform, or at least segregate it into areas with big flashing "danger" signs. I think a *small* number of "pretty" operators can be added to Haskell, without the sky falling, and I'll probably quite like the result. (Does anyone know how to get a copy of the collected The Squiggolist?) Let's face it, if a program is full of Armenian identifiers or Ogham ones I'm not going to have a clue what it's about anyway. But keeping the "standard" -- as in used in core modules -- letter and operator sets smallish is probably a good idea.