
On Sun, Jul 08, 2007 at 04:38:19PM +0200, Malte Milatz wrote:
Tillmann Rendel:
As I understand it (wich may or may not be correct):
A normal Haskell string is basically [Word8]
Hm, let's see whether I understand it better or worse. Actually it is [Char], and Char is a Unicode code point in the range 0..1114111 (at least in GHC). Compare:
Prelude Data.Word> fromEnum (maxBound :: Char) 1114111 Prelude Data.Word> fromEnum (maxBound :: Word8) 255
So it seems that the Char type abstracts the encoding away. I'm actually a little confused by this, because I haven't found any means to make the I/O functions of the Prelude (getContents etc.) encoding-aware: The string "รค", when read from a UTF-8-encoded file via readFile, has a length of 2. Anyone with a URI to enlighten me?
Not sure of any URIs. Char is just a code point. It's a 32 bit integer (64 on 64-bit platforms due to infelicities in the GHC backend) with a code point. It is not bytes. A Char in the heap also has a tag-pointer, bringing the total to 8 (16) bytes. (However, GHC uses shared Char objects for Latin-1 characters, so a "fresh" Char in that range uses 0 bytes). [a] is polymorphic. It is a linked list, it consumes 12 (24) bytes per element. It just stores pointers to its elements, and has no hope of packing anything. [Char] is a linked list of pointers to heap-allocated fullword integers, 20 (40) bytes per character (assuming non-latin1). The GHC IO functions truncate down to 8 bits. There is no way in GHC to read or write full UTF-8, short of doing the encoding yourself (google for UTF8.lhs). Stefan