Re: [Haskell-cafe] Re: Bytestrings and [Char]

23 Mar 2010


      ...
If you read the source code, length do not read the data, that's why
it is so fast. It cannot be done for UTF-8 strings.
I think at this point most the amazement is directed at Data.Text
being slower than good old [Char] (at least for this operation - we
should probably expand our view to more than one operation).
...
Hey, normal string way faster than GNU wc!
No - you need to perform a fair comparison.  Try "wc -c" to only count
characters (not lines and words too).  I'd provide numbers but my wc
doesn't seem to support UTF-8 and not sure what package contains a
unicode aware wc.
...
readChar :: L.ByteString -> Maybe Int64
readChar bs = do (c,_) <- L.uncons bs
                return (choose (fromEnum c))
 where
 choose :: Int -> Int64
 choose c
   | c < 0xc0  = 1
   | c < 0xe0  = 2
   | c < 0xf0  = 3
   | c < 0xf8  = 4
   | otherwise = 1
inspired by Data.ByteString.Lazy.UTF8, same performances as GNU wc (it
is cheating because it do not check the validity of the multibyte char).
Ah, interesting and a worth-while cheat.

Thomas

Re: [Haskell-cafe] Re: Bytestrings and [Char]

Thomas DuBuisson