[Haskell-cafe] Optimising UTF8-CString -> String marshaling, plus comments on withCStringLen/peekCStringLen

23 May 2007

      Hello cafe,

D'ya fancy an optimisation exercise?

In Takusen we currently marshal UTF8-encoded CStrings by first turning the
CString into [word8], and then running this through a [Word8] -> String
UTF8 decoder. We thought it would be more space-efficient (and hopefully
faster) to marshal directly from the CString buffer, rather than use
an intermediate list. We assumed it would be most space-efficient by
working backwards from the end of the CString buffer, like the
peekArray/peekArray0 functions in Foreign.Marshal.Array.

So I implemented it and benchmarked against the original UTF8 marshaling
function, which simply converts CString -> [Word] -> String.
And to my surprise, the [Word8] -> String solution seems to be faster,
and uses less memory, than the function which creates the String directly
from the CString buffer.

Now I appreciate that GHC's optimisations are quite effective (presumably
deforestation should take credit here), but I thought I'd ask the
haskell-cafe optimiser if we could do better with the direct-from-buffer
function. I'm loath to start eyeballing GHC Core, but if needs must...

The code is attached.

I also have some comments/questions about the various CStringLen functions
in Foreign.C.String. The Haddock comment for CStringLen states:
"A string with explicit length information in bytes..."
and for CWString something similar:
"A wide character string with explicit length information in bytes..."

I know this is a blatant lie, though, because the code (unless I've grossly
misunderstood it) for newCStringLen, withStringLen, newCWStringLen, and
withCWStringLen all return the number of Haskell Chars in the String
i.e. the number of Unicode chars, NOT the number of bytes.

However, for the sake of inconsistency, the peekC{W}StringLen functions
take, respectively, the number of Word8 or Word16/Word32 elements (whether
CWString is Word16 or Word32 depends on your plaform, apparently) in the
C{W}String array/buffer. So the outputs from newCStringLen etc are not
reliably usable as inputs to their duals (peekCStringLen etc.) The only
cases that they do work in this way is where the CStrings are encoded with
fixed-width encodings i.e. there are no surrogate units in the encoding.

So we have three different approaches:
 1. Haddock comments say bytes
 2. with/newC{W}StringLen returns unicode char count
 3. peekC{W}StringLen expects Word8 or Word16 count

(1) and (3) can be considered equivalent, in the sense that if you know the
number of Word16 units then you know the number of Word8 units, and vice versa.

It'd be nice if we could have one consistent aproach. For a start, I think
we should eliminate (2), because it's dead easy the get the number of
unicode chars from a String (length, if you didn't know). So whether to
settle on (1) or (3) depends on what the most likely use case is for the
length information. Presumably it's going to be passed to a foreign
function which expects the length either in bytes or in Word16/Word32
units. Does anyone have any evidence or opinion as to which case is the
most common: bytes, or encoding units?

Alistair

[Haskell-cafe] Optimising UTF8-CString -> String marshaling, plus comments on withCStringLen/peekCStringLen

Alistair Bayley