Re: [Haskell-cafe] Optimising UTF8-CString -> String marshaling, plus comments on withCStringLen/peekCStringLen

5 Jun 2007

      ...
...
{- Arity: 4 Strictness: LSSL -}
Right. Unboxed args are always given the annotation L. So that function
is strict in that pointer arg, but GHC is choosing not to unbox it. I'm
not sure why that's the case.
I thought maybe it was because I hadn't said -funbox-strict-fields,
but it didn't change when I did.
...
...
Is there some semantic advantage to bang-patterns, or is it just a
syntactic convenience?
It's syntactic convenience.
I've noticed differences between strictness guards and bang-patterns
with GHC. This is a problem, I think, because bang-patterns are GHC
only, and I wanted to keep the code portable. This strictness guard:

readUTF8Char :: Int -> Int -> Ptr Word8 -> IO Char
readUTF8Char x offset p
  | () !x !offset !p !False = undefined
  | otherwise =
  ...
  where
    x ! y = seq x y

results in this simplifier output:

[Arity 3
 Str: DmdType LSS]
$wreadUTF8Char_r38I =
  \ (ww_s30B :: GHC.Prim.Int#)
    (w_s30D :: GHC.Base.Int)
    (w1_s30E :: GHC.Ptr.Ptr GHC.Word.Word8) ->

However, if I change this to:

{-# OPTIONS_GHC -fbang-patterns #-}

...

readUTF8Char :: Int -> Int -> Ptr Word8 -> IO Char
readUTF8Char !x !offset !p
  | otherwise =

then I get this simpifier output:

[Arity 3
 Str: DmdType LLL]
$wreadUTF8Char_r38n =
  \ (ww_s2Zk :: GHC.Prim.Int#)
    (ww1_s2Zo :: GHC.Prim.Int#)
    (ww2_s2Zs :: GHC.Prim.Addr#) ->

Also, with bang-patterns I've noticed that fromUTF8Ptr transforms
into two functions, which contain very similar code:

[Arity 5]
Foreign.C.UTF8.$s$wfromUTF8Ptr =
  \ (acc_X151 :: GHC.Base.String)
    (new_s_a2GQ :: GHC.Prim.State# GHC.Prim.RealWorld)
    (a87_a2GR :: GHC.Base.Char)
    (ww_s2ZH :: GHC.Prim.Addr#)
    (sc_s36j :: GHC.Prim.Int#) ->

[Arity 4
 Str: DmdType LLSL]
Foreign.C.UTF8.$wfromUTF8Ptr =
  \ (ww_s2ZD :: GHC.Prim.Int#)
    (ww1_s2ZH :: GHC.Prim.Addr#)
    (w_s2ZJ :: GHC.Base.String)
    (w1_s2ZK :: GHC.Prim.State# GHC.Prim.RealWorld) ->

$wfromUTF8Ptr calls itself and $s$wfromUTF8Ptr, but $s$wfromUTF8Ptr
only every calls itself, so $wfromUTF8Ptr could be considered the
wrapper, and $s$wfromUTF8Ptr the worker, I guess.

AFAICT it's a transformation of the various cases in fromUTF8Ptr:

      | x <= 0x7F -> fromUTF8Ptr (bytes-1) p (chr x:acc)
      | x <= 0xBF && bytes == 0 -> error "fromUTF8Ptr: ..."
      | x <= 0xBF -> fromUTF8Ptr (bytes-1) p acc
      | otherwise -> do
          c <- readUTF8Char x bytes p
          fromUTF8Ptr (bytes-1) p (c:acc)

The first case, x <= 0x7F, results in a call to $s$wfromUTF8Ptr.
The third case, x <= 0xBF, results in a call to $wfromUTF8Ptr.
The last case, otherwise, results in a call to $s$wfromUTF8Ptr.

The calls to $s$wfromUTF8Ptr pass the newly constructed Char and the
rest of the String separately, and they are cons'ed in $s$wfromUTF8Ptr.
Not sure what benefit this gives...

I don't know what transformation causes this, but it was a bit of a surprise.

Alistair

Re: [Haskell-cafe] Optimising UTF8-CString -> String marshaling, plus comments on withCStringLen/peekCStringLen

Alistair Bayley