
If you read the source code, length do not read the data, that's why it is so fast. It cannot be done for UTF-8 strings.
I think at this point most the amazement is directed at Data.Text being slower than good old [Char] (at least for this operation - we should probably expand our view to more than one operation).
Hey, normal string way faster than GNU wc!
No - you need to perform a fair comparison. Try "wc -c" to only count characters (not lines and words too). I'd provide numbers but my wc doesn't seem to support UTF-8 and not sure what package contains a unicode aware wc.
readChar :: L.ByteString -> Maybe Int64 readChar bs = do (c,_) <- L.uncons bs return (choose (fromEnum c)) where choose :: Int -> Int64 choose c | c < 0xc0 = 1 | c < 0xe0 = 2 | c < 0xf0 = 3 | c < 0xf8 = 4 | otherwise = 1
inspired by Data.ByteString.Lazy.UTF8, same performances as GNU wc (it is cheating because it do not check the validity of the multibyte char).
Ah, interesting and a worth-while cheat. Thomas