
Hello, [I sent this to the cafe list a week or two ago, but it was buried at the end of a message about optimisation, and got no responses.] In Foreign.C.String, the Haddock comment for CStringLen states: "A string with explicit length information in bytes..." and for CWString something similar: "A wide character string with explicit length information in bytes..." I know this is a blatant lie, though, because the code (unless I've grossly misunderstood it) for newCStringLen, withStringLen, newCWStringLen, and withCWStringLen all return the number of Haskell Chars in the String i.e. the number of Unicode chars, NOT the number of bytes. However, for the sake of inconsistency, the peekC{W}StringLen functions take, respectively, the number of Word8 or Word16/Word32 elements (whether CWString is Word16 or Word32 depends on your plaform, apparently) in the C{W}String array/buffer. So the outputs from newCStringLen etc are not reliably usable as inputs to their duals (peekCStringLen etc.) The only cases that they do work in this way is where the CStrings are encoded with fixed-width encodings i.e. there are no surrogate units in the encoding. So we have three different approaches: 1. Haddock comments say bytes 2. with/newC{W}StringLen returns unicode char count 3. peekC{W}StringLen expects Word8 or Word16 count (1) and (3) can be considered equivalent, in the sense that if you know the number of Word16 units then you know the number of Word8 units, and vice versa. It'd be nice if we could have one consistent approach. For a start, I think we should eliminate (2), because it's dead easy the get the number of unicode chars from a String. So whether to settle on (1) or (3) depends on what the most likely use case is for the length information. Presumably it's going to be passed to a foreign function which expects the length either in bytes or in Word16/Word32 units. Does anyone have any evidence or opinion as to which case is the most common: bytes, or encoding units? Alistair