
Hello cafe, (Following up on my own optimisation question, and Duncan's advice to look at http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs)
If you want to look at some existing optimised UTF8 encoding/decoding code then take a look at the code used in GHC:
http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs
Duncan
I took a look at the UTF8 decoder in GHC. This inspired me to write one that also used unboxed types directly. Pleasingly, it goes like a cut cat, and uses far less space than the naive version, but it's not portable, which is a bummer. (The docs tell me that using GHC.Exts is the "approved" way of accessing GHC-specific extensions, but all of the useful stuff seems to be in GHC.Prim.) After some expriments with the simplifier, I think I have a portable version of a direct-from-buffer decoder which seems to perform nearly as well as one written directly against GHC primitive unboxed functions. I'm wondering if there's anything further I can do to improve performance. The "portable" unboxed version is within about 15% of the unboxed version in terms of time and allocation. Changes I made: - added strictness "annotations" in the form of a strictness guards that are always False - unrolled loops (they were always short loops anyway, with a maximum of 3 or 4 iterations) - replaced shiftL with multiplication, because multiplication unboxes, while shiftL doesn't. Some things I've noticed in the simplifier output: - the shiftL call hasn't unboxed or inlined into a call to uncheckedShiftL#, which I would prefer. Would this be possible if we added unchecked versions of the shiftL/R functions to Data.Bits? - Ptrs don't get unboxed. Why is this? Some IO monad thing? - the chr function tests that its Int argument is less than 1114111, before constructing the Char. It'd be nice to avoid this test. - why does this code: | x <= 0xF7 = remaining 3 (bAND x 0x07) xs | otherwise = err x turn into this i.e. the <= turns into two identical case-branches, using eqword# and ltword#, rather than one case-branch using leword# ? case GHC.Prim.eqWord# a11_a2PJ __word 247 of wild25_X2SU { GHC.Base.False -> case GHC.Prim.ltWord# a11_a2PJ __word 247 of wild6_Xcw { GHC.Base.False -> <error call> GHC.Base.True -> $wremaining_r3dD 3 (__scc {fromUTF8 main:Foreign.C.UTF8 !} GHC.Base.I# (GHC.Prim.word2Int# (GHC.Prim.and# a11_a2PJ __word 7))) xs_aVm }; GHC.Base.True -> $wremaining_r3dD 3 (__scc {fromUTF8 main:Foreign.C.UTF8 !} GHC.Base.I# (GHC.Prim.word2Int# (GHC.Prim.and# a11_a2PJ __word 7))) xs_aVm }; BTW, what's the difference between the indexXxxxOffAddr# and readXxxxOffAddr# functions in GHC.Prim? AFAICT they are equivalent, except that the read* functions take an extra State# s parameter. Presumably this is to thread the IO monad's RealWorld value through, to create some sort of data dependency between the functions (and so to ensure ordered evaluation?) Alistair