Re: [Haskell-cafe] Optimising UTF8-CString -> String marshaling, plus comments on withCStringLen/peekCStringLen

23 May 2007

      On Wed, 2007-05-23 at 10:45 +0100, Alistair Bayley wrote:
...
Hello cafe,
D'ya fancy an optimisation exercise?
In Takusen we currently marshal UTF8-encoded CStrings by first turning the
CString into [word8], and then running this through a [Word8] -> String
UTF8 decoder. We thought it would be more space-efficient (and hopefully
faster) to marshal directly from the CString buffer, rather than use
an intermediate list. We assumed it would be most space-efficient by
working backwards from the end of the CString buffer, like the
peekArray/peekArray0 functions in Foreign.Marshal.Array.
So I implemented it and benchmarked against the original UTF8 marshaling
function, which simply converts CString -> [Word] -> String.
And to my surprise, the [Word8] -> String solution seems to be faster,
and uses less memory, than the function which creates the String directly
from the CString buffer.
Now I appreciate that GHC's optimisations are quite effective (presumably
deforestation should take credit here), but I thought I'd ask the
haskell-cafe optimiser if we could do better with the direct-from-buffer
function. I'm loath to start eyeballing GHC Core, but if needs must...
If you want to look at some existing optimised UTF8 encoding/decoding
code then take a look at the code used in GHC:

http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs

utf8DecodeString and utf8EncodeString. They're both fairly low level,
dealing with pointers to existing buffers. In particular for the
utf8EncodeString you need to have allocated a buffer of the right size
already. You can use utf8EncodedLength to find that if you don't already
know it. Also, utf8DecodeString assumes that the end of the string has
sentinel bytes so it might not be directly suitable for your example.

Duncan

Re: [Haskell-cafe] Optimising UTF8-CString -> String marshaling, plus comments on withCStringLen/peekCStringLen

Duncan Coutts