Optimising UTF8-CString -> String marshaling, plus comments on withCStringLen/peekCStringLen

Hello cafe, D'ya fancy an optimisation exercise? In Takusen we currently marshal UTF8-encoded CStrings by first turning the CString into [word8], and then running this through a [Word8] -> String UTF8 decoder. We thought it would be more space-efficient (and hopefully faster) to marshal directly from the CString buffer, rather than use an intermediate list. We assumed it would be most space-efficient by working backwards from the end of the CString buffer, like the peekArray/peekArray0 functions in Foreign.Marshal.Array. So I implemented it and benchmarked against the original UTF8 marshaling function, which simply converts CString -> [Word] -> String. And to my surprise, the [Word8] -> String solution seems to be faster, and uses less memory, than the function which creates the String directly from the CString buffer. Now I appreciate that GHC's optimisations are quite effective (presumably deforestation should take credit here), but I thought I'd ask the haskell-cafe optimiser if we could do better with the direct-from-buffer function. I'm loath to start eyeballing GHC Core, but if needs must... The code is attached. I also have some comments/questions about the various CStringLen functions in Foreign.C.String. The Haddock comment for CStringLen states: "A string with explicit length information in bytes..." and for CWString something similar: "A wide character string with explicit length information in bytes..." I know this is a blatant lie, though, because the code (unless I've grossly misunderstood it) for newCStringLen, withStringLen, newCWStringLen, and withCWStringLen all return the number of Haskell Chars in the String i.e. the number of Unicode chars, NOT the number of bytes. However, for the sake of inconsistency, the peekC{W}StringLen functions take, respectively, the number of Word8 or Word16/Word32 elements (whether CWString is Word16 or Word32 depends on your plaform, apparently) in the C{W}String array/buffer. So the outputs from newCStringLen etc are not reliably usable as inputs to their duals (peekCStringLen etc.) The only cases that they do work in this way is where the CStrings are encoded with fixed-width encodings i.e. there are no surrogate units in the encoding. So we have three different approaches: 1. Haddock comments say bytes 2. with/newC{W}StringLen returns unicode char count 3. peekC{W}StringLen expects Word8 or Word16 count (1) and (3) can be considered equivalent, in the sense that if you know the number of Word16 units then you know the number of Word8 units, and vice versa. It'd be nice if we could have one consistent aproach. For a start, I think we should eliminate (2), because it's dead easy the get the number of unicode chars from a String (length, if you didn't know). So whether to settle on (1) or (3) depends on what the most likely use case is for the length information. Presumably it's going to be passed to a foreign function which expects the length either in bytes or in Word16/Word32 units. Does anyone have any evidence or opinion as to which case is the most common: bytes, or encoding units? Alistair

On Wed, 2007-05-23 at 10:45 +0100, Alistair Bayley wrote:
Hello cafe,
D'ya fancy an optimisation exercise?
In Takusen we currently marshal UTF8-encoded CStrings by first turning the CString into [word8], and then running this through a [Word8] -> String UTF8 decoder. We thought it would be more space-efficient (and hopefully faster) to marshal directly from the CString buffer, rather than use an intermediate list. We assumed it would be most space-efficient by working backwards from the end of the CString buffer, like the peekArray/peekArray0 functions in Foreign.Marshal.Array.
So I implemented it and benchmarked against the original UTF8 marshaling function, which simply converts CString -> [Word] -> String. And to my surprise, the [Word8] -> String solution seems to be faster, and uses less memory, than the function which creates the String directly from the CString buffer.
Now I appreciate that GHC's optimisations are quite effective (presumably deforestation should take credit here), but I thought I'd ask the haskell-cafe optimiser if we could do better with the direct-from-buffer function. I'm loath to start eyeballing GHC Core, but if needs must...
If you want to look at some existing optimised UTF8 encoding/decoding code then take a look at the code used in GHC: http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs utf8DecodeString and utf8EncodeString. They're both fairly low level, dealing with pointers to existing buffers. In particular for the utf8EncodeString you need to have allocated a buffer of the right size already. You can use utf8EncodedLength to find that if you don't already know it. Also, utf8DecodeString assumes that the end of the string has sentinel bytes so it might not be directly suitable for your example. Duncan

Hello cafe, (Following up on my own optimisation question, and Duncan's advice to look at http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs)
If you want to look at some existing optimised UTF8 encoding/decoding code then take a look at the code used in GHC:
http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs
Duncan
I took a look at the UTF8 decoder in GHC. This inspired me to write one that also used unboxed types directly. Pleasingly, it goes like a cut cat, and uses far less space than the naive version, but it's not portable, which is a bummer. (The docs tell me that using GHC.Exts is the "approved" way of accessing GHC-specific extensions, but all of the useful stuff seems to be in GHC.Prim.) After some expriments with the simplifier, I think I have a portable version of a direct-from-buffer decoder which seems to perform nearly as well as one written directly against GHC primitive unboxed functions. I'm wondering if there's anything further I can do to improve performance. The "portable" unboxed version is within about 15% of the unboxed version in terms of time and allocation. Changes I made: - added strictness "annotations" in the form of a strictness guards that are always False - unrolled loops (they were always short loops anyway, with a maximum of 3 or 4 iterations) - replaced shiftL with multiplication, because multiplication unboxes, while shiftL doesn't. Some things I've noticed in the simplifier output: - the shiftL call hasn't unboxed or inlined into a call to uncheckedShiftL#, which I would prefer. Would this be possible if we added unchecked versions of the shiftL/R functions to Data.Bits? - Ptrs don't get unboxed. Why is this? Some IO monad thing? - the chr function tests that its Int argument is less than 1114111, before constructing the Char. It'd be nice to avoid this test. - why does this code: | x <= 0xF7 = remaining 3 (bAND x 0x07) xs | otherwise = err x turn into this i.e. the <= turns into two identical case-branches, using eqword# and ltword#, rather than one case-branch using leword# ? case GHC.Prim.eqWord# a11_a2PJ __word 247 of wild25_X2SU { GHC.Base.False -> case GHC.Prim.ltWord# a11_a2PJ __word 247 of wild6_Xcw { GHC.Base.False -> <error call> GHC.Base.True -> $wremaining_r3dD 3 (__scc {fromUTF8 main:Foreign.C.UTF8 !} GHC.Base.I# (GHC.Prim.word2Int# (GHC.Prim.and# a11_a2PJ __word 7))) xs_aVm }; GHC.Base.True -> $wremaining_r3dD 3 (__scc {fromUTF8 main:Foreign.C.UTF8 !} GHC.Base.I# (GHC.Prim.word2Int# (GHC.Prim.and# a11_a2PJ __word 7))) xs_aVm }; BTW, what's the difference between the indexXxxxOffAddr# and readXxxxOffAddr# functions in GHC.Prim? AFAICT they are equivalent, except that the read* functions take an extra State# s parameter. Presumably this is to thread the IO monad's RealWorld value through, to create some sort of data dependency between the functions (and so to ensure ordered evaluation?) Alistair

On Mon, 2007-06-04 at 09:43 +0100, Alistair Bayley wrote:
After some expriments with the simplifier, I think I have a portable version of a direct-from-buffer decoder which seems to perform nearly as well as one written directly against GHC primitive unboxed functions. I'm wondering if there's anything further I can do to improve performance. The "portable" unboxed version is within about 15% of the unboxed version in terms of time and allocation.
Well done.
Changes I made: - added strictness "annotations" in the form of a strictness guards that are always False - unrolled loops (they were always short loops anyway, with a maximum of 3 or 4 iterations) - replaced shiftL with multiplication, because multiplication unboxes, while shiftL doesn't.
Some things I've noticed in the simplifier output: - the shiftL call hasn't unboxed or inlined into a call to uncheckedShiftL#, which I would prefer. Would this be possible if we added unchecked versions of the shiftL/R functions to Data.Bits?
I think this was discussed before. Check the archives. I don't think there was any resolution however. As I recall there was some attempt to get the ordinary shiftL/R functions to inline and eliminate the bounds checks when the shifts were constants less than the bounds.
- Ptrs don't get unboxed. Why is this? Some IO monad thing?
Got any more detail?
- the chr function tests that its Int argument is less than 1114111, before constructing the Char. It'd be nice to avoid this test.
Yeah. In Data.ByteString.Char8 we invent this w2c & c2w functions to avoid the test. There should probably be a standard version of this unchecked conversion.
- why does this code:
| x <= 0xF7 = remaining 3 (bAND x 0x07) xs | otherwise = err x
turn into this i.e. the <= turns into two identical case-branches, using eqword# and ltword#, rather than one case-branch using leword# ?
I thought it might be to do with the instance declaration not giving a method for <= but letting it default to being defined in terms of < and ==, however it's hard to say what's really going on, since the instance declaration is just: data Word = W# Word# deriving (Eq, Ord) I have no idea how ghc derives Eq and Ord instances for a type when it contains unboxed components, which cannot themselves be members of a type class.
BTW, what's the difference between the indexXxxxOffAddr# and readXxxxOffAddr# functions in GHC.Prim? AFAICT they are equivalent, except that the read* functions take an extra State# s parameter. Presumably this is to thread the IO monad's RealWorld value through, to create some sort of data dependency between the functions (and so to ensure ordered evaluation?)
Right. So it'd only be safe to use the index ones on immutable arrays because there's no way to enforce sequencing with respect to array writes when using the index version. Duncan

Hello Duncan, Monday, June 4, 2007, 2:25:10 PM, you wrote:
- the chr function tests that its Int argument is less than 1114111, before constructing the Char. It'd be nice to avoid this test.
use unsafeChr or, for portability, smth like this: #ifdef GHC import GHC.Exts (unsafeChr) #else unsafeChr = chr #endif -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 04/06/07, Duncan Coutts
On Mon, 2007-06-04 at 09:43 +0100, Alistair Bayley wrote:
After some experiments with the simplifier, ... The "portable" unboxed version is within about 15% of the unboxed version in terms of time and allocation.
Well done.
Of course, that might be saying more about the performance of the unboxed version...
Yeah. In Data.ByteString.Char8 we invent this w2c & c2w functions to avoid the test. There should probably be a standard version of this unchecked conversion.
Bulat suggested unsafeChr from GHC.Exts, but I can't see this. I guess I could roll my own; after all it's just (C# (chr# x)).
BTW, what's the difference between the indexXxxxOffAddr# and readXxxxOffAddr# functions in GHC.Prim?
Right. So it'd only be safe to use the index ones on immutable arrays because there's no way to enforce sequencing with respect to array writes when using the index version.
In this case I'm reading from a CString buffer, which is (hopefully) not changing during the function invocation, and never written to by my code. So presumably it'd be pretty safe to use the index- functions.
- Ptrs don't get unboxed. Why is this? Some IO monad thing?
Got any more detail?
OK. readUTF8Char's transformation starts with this: $wreadUTF8Char_r3de = \ (ww_s33v :: GHC.Prim.Int#) (w_s33x :: GHC.Ptr.Ptr GHC.Word.Word8) -> If we expect it to unbox, I'd expect the Ptr to become Addr#. Later, this (w_s33x) gets unboxed just before it's used: case w_s33x of wild6_a2JM { GHC.Ptr.Ptr a_a2JO -> case GHC.Prim.readWord8OffAddr# @ GHC.Prim.RealWorld a_a2JO 1 s_a2Jf readUTF8Char is called by fromUTF8Ptr, where there's a little Ptr arithmetic. The Ptr argument to fromUTF8Ptr is unboxed, offset is added, and the result is reboxed so that it can be consumed by readUTF8Char. All a bit unnecessary, I think e.g. Foreign.C.UTF8.$wfromUTF8Ptr = ... let { p'_s38N [Just D(T)] :: GHC.Ptr.Ptr GHC.Word.Word8 [Str: DmdType] p'_s38N = __scc {fromUTF8Ptr main:Foreign.C.UTF8 !} case w_s33J of wild11_a2DW { GHC.Ptr.Ptr addr_a2DY -> GHC.Ptr.Ptr @ GHC.Word.Word8 (GHC.Prim.plusAddr# addr_a2DY ww_s33H) } } in ... I'd prefer the Ptr arg to fromUTF8Ptr to also be unboxed, so that the primitive plusAddr# can be used directly on it before it's passed to readUTF8Char. Perhaps instead I could push this Ptr arithmetic down to readUTF8Char, and pass it the constant Ptr to the start of the buffer, and the offset into it, rather than a Ptr to the current position. Alistair

On Mon, 2007-06-04 at 13:12 +0100, Alistair Bayley wrote:
BTW, what's the difference between the indexXxxxOffAddr# and readXxxxOffAddr# functions in GHC.Prim?
Right. So it'd only be safe to use the index ones on immutable arrays because there's no way to enforce sequencing with respect to array writes when using the index version.
In this case I'm reading from a CString buffer, which is (hopefully) not changing during the function invocation, and never written to by my code. So presumably it'd be pretty safe to use the index- functions.
Yes.
- Ptrs don't get unboxed. Why is this? Some IO monad thing?
Got any more detail?
OK. readUTF8Char's transformation starts with this:
$wreadUTF8Char_r3de = \ (ww_s33v :: GHC.Prim.Int#) (w_s33x :: GHC.Ptr.Ptr GHC.Word.Word8) ->
If we expect it to unbox, I'd expect the Ptr to become Addr#. Later, this (w_s33x) gets unboxed just before it's used:
case w_s33x of wild6_a2JM { GHC.Ptr.Ptr a_a2JO -> case GHC.Prim.readWord8OffAddr# @ GHC.Prim.RealWorld a_a2JO 1 s_a2Jf
readUTF8Char is called by fromUTF8Ptr, where there's a little Ptr arithmetic. The Ptr argument to fromUTF8Ptr is unboxed, offset is added, and the result is reboxed so that it can be consumed by readUTF8Char. All a bit unnecessary, I think e.g.
Are you sure fromUTF8Ptr is strict in its ptr arg? Try with a ! pattern on that arg. You'll need -fbang-patterns. That translates into the seq False trick that oy're already using elsewhere. Experimenting by adding ! patterns is much quicker and easier however. Once you've got the right set of strictness annotations you can go back to using the more portable, but ugly seq False trick. You can also get ghc to tell you what strictness it inferred for your functions. It's shown in the .hi file. Use ghc --show-iface UTF8.hi. I think the "UL" syntax for describing the strictness is described in the GHC manual somewhere (or perhaps it's on the GHC wiki). Duncan

On Mon, Jun 04, 2007 at 09:43:32AM +0100, Alistair Bayley wrote:
(The docs tell me that using GHC.Exts is the "approved" way of accessing GHC-specific extensions, but all of the useful stuff seems to be in GHC.Prim.)
All of the useful stuff *is* exported from GHC.Exts, it even says so in the haddock: Synopsis ... module GHC.Prim That is, GHC.Exts exports everything GHC.Prim does. Standard H98 re-export syntax. Besides, user code can't import GHC.Prim at all in GHC HEADs newer than a couple months (arguably a bug, but it only breaks bad code, so...)
Some things I've noticed in the simplifier output: - the shiftL call hasn't unboxed or inlined into a call to uncheckedShiftL#, which I would prefer. Would this be possible if we added unchecked versions of the shiftL/R functions to Data.Bits? - Ptrs don't get unboxed. Why is this? Some IO monad thing?
fromUTF8Ptr unboxes fine for me with HEAD and 6.6.1.
- the chr function tests that its Int argument is less than 1114111, before constructing the Char. It'd be nice to avoid this test.
You want unsafeChr from the (undocumented) GHC.Base module. http://darcs.haskell.org/ghc-6.6/packages/base/GHC/Base.lhs for reference (but don't copy the file, it's already an importable module).
- why does this code:
| x <= 0xF7 = remaining 3 (bAND x 0x07) xs | otherwise = err x
turn into this i.e. the <= turns into two identical case-branches, using eqword# and ltword#, rather than one case-branch using leword# ?
case GHC.Prim.eqWord# a11_a2PJ __word 247 of wild25_X2SU { GHC.Base.False -> case GHC.Prim.ltWord# a11_a2PJ __word 247 of wild6_Xcw { GHC.Base.False -> <error call> GHC.Base.True -> $wremaining_r3dD 3 (__scc {fromUTF8 main:Foreign.C.UTF8 !} GHC.Base.I# (GHC.Prim.word2Int# (GHC.Prim.and# a11_a2PJ __word 7))) xs_aVm }; GHC.Base.True -> $wremaining_r3dD 3 (__scc {fromUTF8 main:Foreign.C.UTF8 !} GHC.Base.I# (GHC.Prim.word2Int# (GHC.Prim.and# a11_a2PJ __word 7))) xs_aVm };
ISTR seeing a bug report about this a while back, we know it's dumb. You could probably use x < 0xF8 instead.
BTW, what's the difference between the indexXxxxOffAddr# and readXxxxOffAddr# functions in GHC.Prim? AFAICT they are equivalent, except that the read* functions take an extra State# s parameter. Presumably this is to thread the IO monad's RealWorld value through, to create some sort of data dependency between the functions (and so to ensure ordered evaluation?)
Exactly. readFoo won't be reordered, indexFoo will - which matters when doing reads and writes at addresses that might alias. Stefan

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Stefan O'Rear
fromUTF8Ptr unboxes fine for me with HEAD and 6.6.1.
- the chr function tests that its Int argument is less than 1114111, before constructing the Char. It'd be nice to avoid this test.
You want unsafeChr from the (undocumented) GHC.Base module. http://darcs.haskell.org/ghc-6.6/packages/base/GHC/Base.lhs for reference (but don't copy the file, it's already an importable module).
FWIW, I've optimised this to a point where I'm happy with it, and you can see the results here: http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs I was using ghc-6.6 back in June, and an upgrade to 6.6.1 fixed some of the issues for me (e.g. unboxing Ptrs, bang-patterns differ from seq). I sent a test case to Simon PJ about the duplicated code in the simplifier output, but I can't tell if it's been added as a trac ticket. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Bayley, Alistair wrote:
From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Stefan O'Rear
fromUTF8Ptr unboxes fine for me with HEAD and 6.6.1.
- the chr function tests that its Int argument is less than 1114111, before constructing the Char. It'd be nice to avoid this test. You want unsafeChr from the (undocumented) GHC.Base module. http://darcs.haskell.org/ghc-6.6/packages/base/GHC/Base.lhs for reference (but don't copy the file, it's already an importable module).
FWIW,
I've optimised this to a point where I'm happy with it, and you can see the results here: http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs
In that code you have: | x <= 0x0010FFFF -- should be 0x001FFFFF I wasn't aware that the largest unicode code point had changed. Do you have a reference? Should we change the range of Char in GHC? Cheers, Simon

From: Simon Marlow [mailto:simonmarhaskell@gmail.com]
In that code you have:
| x <= 0x0010FFFF -- should be 0x001FFFFF
I wasn't aware that the largest unicode code point had changed. Do you have a reference? Should we change the range of Char in GHC?
No, that's merely a bad comment. I think it was meant to refer to that fact that the UTF8 encoding will permit codepoints up to 0x001FFFFF with 4 bytes, so if the decoder was to handle the full UTF8 range (up to 6 bytes) then this test would read:
| x <= 0x001FFFFF
Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Weird... I sent this over a month ago, and was a bit puzzled as to why it didn't appear on the list. Has it been waiting for a moderator to release?
-----Original Message----- From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Alistair Bayley Sent: 04 June 2007 09:44 To: haskell-cafe Cc: Duncan Coutts Subject: Re: [Haskell-cafe] Optimising UTF8-CString -> String marshaling,plus comments on withCStringLen/peekCStringLen
Hello cafe,
(Following up on my own optimisation question, and Duncan's advice to look at http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs)
If you want to look at some existing optimised UTF8 encoding/decoding code then take a look at the code used in GHC:
http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs
Duncan
I took a look at the UTF8 decoder in GHC. This inspired me to write one that also used unboxed types directly. Pleasingly, it goes like a cut cat, and uses far less space than the naive version, but it's not portable, which is a bummer. ...
Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************
participants (6)
-
Alistair Bayley
-
Bayley, Alistair
-
Bulat Ziganshin
-
Duncan Coutts
-
Simon Marlow
-
Stefan O'Rear