
On Tue, Jun 05, 2007 at 10:13:17PM +0100, Alistair Bayley wrote:
In Foreign.C.String, the Haddock comment for CStringLen states: "A string with explicit length information in bytes..." and for CWString something similar: "A wide character string with explicit length information in bytes..."
I know this is a blatant lie, though, because the code (unless I've grossly misunderstood it) for newCStringLen, withStringLen, newCWStringLen, and withCWStringLen all return the number of Haskell Chars in the String i.e. the number of Unicode chars, NOT the number of bytes.
The comment on CWString was incorrect, and has now been corrected to say CWchars, as per the FFI addendum. newCAStringLen and withCAStringLen are specified as working only for Chars in the range 0..255, and yielding the number of bytes (which is the same as the number of characters, in that range), and this they do. The CString functions are specified as performing a locale-based conversion (and the Len is supposed to denote bytes), but this has not been implemented. All the CString functions are just aliases for their CAString counterparts. (This is now bug #1414.) newCWStringLen and withCWStringLen are supposed to yield the number of CWchar's, and this they do. It just happens that under GNU libc, wchar_t is defined as being UCS-4 (matching Char). Under Windows wchar_t is a 16-bit quantity, with Unicode encoded using UTF-16, so it will show a difference on characters '\x10000' and higher.
Presumably it's going to be passed to a foreign function which expects the length either in bytes or in Word16/Word32 units. Does anyone have any evidence or opinion as to which case is the most common: bytes, or encoding units?
The foreign function should use char* for CString and wchar_t* for CWString, so lengths in those respective units are most appropriate.