
On 04 May 2005 15:57, Bulat Ziganshin wrote:
it is true what to support unicode source files only StringBuffer implementation must be changed?
It depends whether you want to support several different encodings, or just UTF-8. If we only want to support UTF-8, then we can keep the StringBuffer in UTF-8 and also FastStrings. (or you could re-encode the other encodings into UTF-8).
if so, then task can be simplified by converting any files read by hGetStringBuffer to UTF-32 (PackedString) representation and putting in memory array in this form. After this, we must change indexing of ByteArray to indexing of Array Int Char, and somewhat replace call to mkFastSubStringBA#.
This is the other alternative. It uses rather more memory, but that might not be an issue. The other thing that needs to be changed is the lexer, to be able to recognise classes of Unicode characters (i.e. upper/lower case for identifiers, symbol characters. etc.). The code recently added to the libraries can be used for this, I believe. The question is what Alex should see for a unicode character: Alex currently assumes that characters are in the range 0-255 (you need a fixed range in order to generate the lexer tables). One possibility is to map all Unicode upper-case characters to a single character code for Alex, and similarly for the other classes of character.
btw, why in FastString module unicode strings are saved as [Int], not as String itself?
Probably for reasons that are no longer relevant. When we changed Char from 8 to 32 bits, we still had to compile GHC with older versions of itself that only supported 8-bit Chars. Cheers, Simon