Unicode support in Hugs - alpha-patch available
Hi, Anyone interested in Unicode support in Hugs (what it lacks so far) please check out this URL: http://www.golubovsky.org/software/hugs-patch/article.html I have written a patch for the November 2002 release of Hugs that enables internal handling of Unicode characters by Hugs. The URL above points to the article I wrote to explain the details. The article also contains links to download the patch itself and the demonstration/testing program. Any feedback is welcome. I am especially interested to hear from the core developers of Hugs whether they are intersted to incorporate these changes into the current CVS version and into future releases. PS Hopefully this livens up the traffic on the list - for several months there have been too few messages. -- Dmitry M. Golubovsky South Lyon, MI
Hi Dimitry, | Anyone interested in Unicode support in Hugs (what it lacks so far) | please check out this URL: | | http://www.golubovsky.org/software/hugs-patch/article.html Thanks for posting this. I notice from your comments that you were trying to figure out how Hugs works from the C code alone. For many purposes, you and others trying to do similar things might find that the report I wrote about the implementation of Gofer (from which Hugs was derived) still contains many relevant tidbits of information that might make some things easier to understand. You can find it online at: http://www.cse.ogi.edu/~mpj/pubs/goferimp.html You might also find it interesting for its historical perspective. For example, it might explain some of the design decisions that made sense back then but seem harder to justify today (such as the lack of support for Unicode, which barely existed back then ...) All the best, Mark
On Sun, Aug 17, 2003 at 11:35:31PM -0400, Dimitry Golubovsky wrote:
Anyone interested in Unicode support in Hugs (what it lacks so far) please check out this URL:
http://www.golubovsky.org/software/hugs-patch/article.html
I have written a patch for the November 2002 release of Hugs that enables internal handling of Unicode characters by Hugs. The URL above points to the article I wrote to explain the details. The article also contains links to download the patch itself and the demonstration/testing program.
As a general comment: your patch converts the Unicode Database into an internal table in Hugs for use by primitives. An alternative approach is used by a recent addition of Unicode support to GHC: use the native wide character functions iswupper(), towupper(), etc where these are available. The current CVS version of Hugs also includes an optimization of the whatis() code, which may clash with your changes. However the speed gains from that change are modest -- increased functionality may be more important.
[The] number of distinct characters defined by the Unicode Database (UnicodeData.txt available from www.unicode.org is 15100 for the most recent version (4.0) with Unicode character values ranging from 0x0000 to 0x10FFFD. So, position of a character in the Unicode character table may be used by Hugs as internal character code.
UnicodeData.txt may contain that many character lines, but it includes
pairs of lines like
4E00;
[returning to the list after some discussion with Dimitry] Just to be clear, the Unicode support under discussion comprises only: - making ord(maxBound::Char) a lot bigger, say 0x10FFFD. - making the character classification and case conversion functions in the Char module work on the expanded range. Dimitry's approach is based on mapping Unicode code points to a smaller set of internal codes, both to conserve cell tags (which currently include all character codes) and to reduce the size of the mapping tables. One problem there is that there are an order of magnitude more characters than he thought: 235963 out of 1114110 possible codes are allocated in Unicode 4.0.1. But also the Haskell Char type belongs to the Bounded, Enum and Ix classes, and so should be a contiguous range of values with no gaps. So I think the Unicode code points should be used internally as well. How should these be stored on the heap? We could use a scheme similar to the representation of Int: values smaller than some threshold map to cell tags, while larger values are stored in a pair. How do we implement the conversion functions? The approach recently added to the CVS version of GHC is to use the native libraries, which requires the user to set the locale appropriately. More generally, should these functions be locale-dependent at all? I appreciate that case conversion varies between languages (e.g. i's in Turkish) but I think that requires the IO monad. The Haskell Report is vague on the issue, but I think the functions should just do the default mappings, just as Ord compares character codes rather than using locales. The alternative is to include tables in the Hugs binary. These could be quite compact: fewer than 800 Unicode characters have upper case mappings, and similarly for lower case mappings. Also, adjacent Unicode characters often belong to the same category: the full range can be divided into fewer than 2000 contiguous subranges of characters belonging to the same category, including gaps of unallocated characters. Applying binary search to these tables would take a little time, but probably not much more than the current in-Haskell implementations. Less than 30k of data would be required. Another problem is that Hugs contains an array indexed by character values (consCharArray) containing (: c) for each character c. With a much larger range of character codes, one approach would be a hash table holding a limited number of such values, built lazily and bypassed when full. This change could be confined to the consChar() function in builtin.c. Finally, most of this could be #ifdef UNICODE_CHARS, so the overhead could be turned off if required.
OK, I see where I am wrong - thanks for pointing me out. I would agree with Ross that it is better to embed some minimal information about characters (case conversion, category) into the Hugs core because almost every program needs them, and the rest (like directionality, composition, etc.) might be implemented later for programs that really would need that in some special way. Ross Paterson wrote:
[returning to the list after some discussion with Dimitry]
Just to be clear, the Unicode support under discussion comprises only: - making ord(maxBound::Char) a lot bigger, say 0x10FFFD. - making the character classification and case conversion functions in the Char module work on the expanded range.
[skip] -- Dmitry M. Golubovsky South Lyon, MI
On Mon, Aug 25, 2003 at 12:18:24PM +0100, Ross Paterson wrote:
Just to be clear, the Unicode support under discussion comprises only: - making ord(maxBound::Char) a lot bigger, say 0x10FFFD. - making the character classification and case conversion functions in the Char module work on the expanded range.
I forgot string literals, which are essential for a basic implementation. They could be handled by re-interpreting the internal String type (not the Haskell type) as UTF-encoded strings. It's all feasible, but it's a bit more complicated than I thought at first.
participants (3)
-
Dimitry Golubovsky -
Mark P Jones -
Ross Paterson