Re: [Haskell'-private] pragmas and annotations (RE: the record system)

"Simon Marlow"
How does ENCODING work for a UTF-16 file, for example? We don't know the file is UTF-16 until we read the ENCODING pragma, and we can't read the ENCODING pragma because it's in UTF-16.
Use the same type of heuristic as XML uses (for instance). * If the first three bytes of the file are "{-#", then keep reading in ASCII/Latin-1/whatever until you discover an ENCODING decl (or not). * If the first six bytes of the file are one of the two possible UTF-16 representations of "{-#", then assume UTF-16 with that byte-encoding until we find the ENCODING decl. (A missing decl in this case would be an error.) * If the first twelve bytes of the file are a UCS-4 representation of "{-#" then ... you get the picture. * For UTF-16 and UCS-4 variations, you must also permit the file to begin with an optional byte-order mark (two or four bytes). * Otherwise, there is no ENCODING pragma, so assume the implementation default of {ASCII, Latin-1, UTF-8, ...}. I know it's pretty horrible, but it seems to work in practice for the XML people. In practice, the ENCODING decl is most needed for those that have ASCII as a subset - one could argue that the heuristic tells you the UTF-16 and UCS-4 variations without needing a pragma. (But then, how would you guarantee that the first three characters in the file must be "{-#" ?) Regards, Malcolm

Malcolm.Wallace wrote:
(But then, how would you guarantee that the first three characters in the file must be "{-#" ?)
In particular, what do you propose for literate source? (I hardly have any .hs files.) As far as I can see, it seems to be possible to get LaTeX to work with UTF8; the (apparently not extremely active) ``Unicode TeX project'' Omega apparently started out with ``16-bit Unicode'' (http://omega.enstb.org/) and now turned to 31-bit characters (http://omega.cse.unsw.edu.au/omega/), and the future may of course bring us other variants... (Isn't it great that we can add a new dimension to Wadler's law by discussing character encodings? ;-) Wolfram

Malcolm Wallace wrote:
* If the first three bytes of the file are "{-#", then keep reading in ASCII/Latin-1/whatever until you discover an ENCODING decl (or not).
* If the first six bytes of the file are one of the two possible UTF-16 representations of "{-#", then assume UTF-16 with that byte-encoding until we find the ENCODING decl. (A missing decl in this case would be an error.)
* If the first twelve bytes of the file are a UCS-4 representation of "{-#" then ... you get the picture.
* For UTF-16 and UCS-4 variations, you must also permit the file to begin with an optional byte-order mark (two or four bytes).
You'd also want to look for the UTF-8 BOM, which is very common in Windows. As for literate source, I suppose you could forbid .lhs files from using UTF-16 or UCS-32 unless there's a BOM. Then unlit wouldn't need to know the encoding (I think), and the .hs heuristics would work on the output. -- Ben
participants (3)
-
Ben Rudiak-Gould
-
kahl@cas.mcmaster.ca
-
Malcolm Wallace