Re: Length returned by CStringLen functions

5 Jun 2007

      On Tue, Jun 05, 2007 at 10:13:17PM +0100, Alistair Bayley wrote:
...
In Foreign.C.String, the Haddock comment for CStringLen states:
"A string with explicit length information in bytes..."
and for CWString something similar:
"A wide character string with explicit length information in bytes..."
I know this is a blatant lie, though, because the code (unless I've grossly
misunderstood it) for newCStringLen, withStringLen, newCWStringLen, and
withCWStringLen all return the number of Haskell Chars in the String
i.e. the number of Unicode chars, NOT the number of bytes.
The comment on CWString was incorrect, and has now been corrected to say
CWchars, as per the FFI addendum.

newCAStringLen and withCAStringLen are specified as working only for
Chars in the range 0..255, and yielding the number of bytes (which is
the same as the number of characters, in that range), and this they do.

The CString functions are specified as performing a locale-based
conversion (and the Len is supposed to denote bytes), but this has not
been implemented.  All the CString functions are just aliases for their
CAString counterparts.  (This is now bug #1414.)

newCWStringLen and withCWStringLen are supposed to yield the number of
CWchar's, and this they do.  It just happens that under GNU libc, wchar_t
is defined as being UCS-4 (matching Char).  Under Windows wchar_t is a
16-bit quantity, with Unicode encoded using UTF-16, so it will show a
difference on characters '\x10000' and higher.
...
Presumably it's going to be passed to a foreign function which expects
the length either in bytes or in Word16/Word32 units. Does anyone have
any evidence or opinion as to which case is the most common: bytes,
or encoding units?
The foreign function should use char* for CString and wchar_t* for
CWString, so lengths in those respective units are most appropriate.

Re: Length returned by CStringLen functions

Ross Paterson