
Hi!
On Fri, May 30, 2008 at 10:38 AM, Ketil Malde
"Johan Tibell"
writes: The intent of the not-yet-existing Unicode string is to represent text not bytes.
Right, so this will replace the .Char8 modules as well? What confused me was my misunderstanding Duncan to mean that Unicode text would somehow imply shorter strings than non-Unicode (i.e. 8-bit) text.
Yes.
To give just one example, short (Unicode) strings are common as keys in associative data structures like maps
I guess typically, you'd break things down to words, so strings of lenght 4-10 or so. BS uses three words and LBS four (IIRC), so the cost of sharing typically outweighs the benefit.
I'm not sure if you would have much sharing in a map as the keys will be unique.
Can I also here insert a plea for keeping lazy I/O out of the new Unicode module?
I use ByteString.Lazy almost exclusively. I realize it there's a penalty in time and space, but the ability to write applications that stream over multi-Gb files is essential.
Lazy I/O comes with a penalty in terms of correctness! Pretending that I/O and the underlying resource allocations (e.g. file handles) aren't observable is bad. Lazy I/O is kinda, maybe usable for small scripts that reads a file or two an spits out a result but for servers it doesn't work at all. Lazy I/O requires unsafe* functions and is therefore, unsafe. The finalizers required can be arbitrary complex depending on what kind of resources need to be allocated. The simple case is a file handle but there's no reason we might need sockets, locks, etc to create the lazy ByteString. Here are two possible interfaces for safe I/O. One isstream based one with explicit close and the other fold based one (i.e. inversion of control):
import qualified Data.ByteString as S
-- Stream based I/O. class InputStream s where read :: s -> IO Word8 readN :: s -> Int -> IO S.ByteString -- efficient block reads close :: s -> IO ()
openBinaryFile :: InputStream s => FilePath -> IO s
or a left fold over the file's content. The 'foldBytes' function can close the file at EOF.
-- Left fold/callback based I/O. foldBytes :: FilePath -> (seed -> Word8 -> Either seed seed) -> seed -> IO seed -- Efficient block reads. foldChunks :: FilePath -> (seed -> S.ByteString -> Either seed seed) -> seed -> IO seed
on top of this you might want monadic versions of the above two functions. The case for a Unicode type are analogous.
Of course, these applications couldn't care less about Unicode, so perhaps the usage is different.
The issue of lazy I/O is orthogonal to ByteString vs Unicode(String). Cheers, Johan