[returning to the list after some discussion with Dimitry] Just to be clear, the Unicode support under discussion comprises only: - making ord(maxBound::Char) a lot bigger, say 0x10FFFD. - making the character classification and case conversion functions in the Char module work on the expanded range. Dimitry's approach is based on mapping Unicode code points to a smaller set of internal codes, both to conserve cell tags (which currently include all character codes) and to reduce the size of the mapping tables. One problem there is that there are an order of magnitude more characters than he thought: 235963 out of 1114110 possible codes are allocated in Unicode 4.0.1. But also the Haskell Char type belongs to the Bounded, Enum and Ix classes, and so should be a contiguous range of values with no gaps. So I think the Unicode code points should be used internally as well. How should these be stored on the heap? We could use a scheme similar to the representation of Int: values smaller than some threshold map to cell tags, while larger values are stored in a pair. How do we implement the conversion functions? The approach recently added to the CVS version of GHC is to use the native libraries, which requires the user to set the locale appropriately. More generally, should these functions be locale-dependent at all? I appreciate that case conversion varies between languages (e.g. i's in Turkish) but I think that requires the IO monad. The Haskell Report is vague on the issue, but I think the functions should just do the default mappings, just as Ord compares character codes rather than using locales. The alternative is to include tables in the Hugs binary. These could be quite compact: fewer than 800 Unicode characters have upper case mappings, and similarly for lower case mappings. Also, adjacent Unicode characters often belong to the same category: the full range can be divided into fewer than 2000 contiguous subranges of characters belonging to the same category, including gaps of unallocated characters. Applying binary search to these tables would take a little time, but probably not much more than the current in-Haskell implementations. Less than 30k of data would be required. Another problem is that Hugs contains an array indexed by character values (consCharArray) containing (: c) for each character c. With a much larger range of character codes, one approach would be a hash table holding a limited number of such values, built lazily and bypassed when full. This change could be confined to the consChar() function in builtin.c. Finally, most of this could be #ifdef UNICODE_CHARS, so the overhead could be turned off if required.