Re: [Haskell-cafe] Copying Arrays

30 May 2008


      Hi!

On Fri, May 30, 2008 at 10:38 AM, Ketil Malde  wrote:
...
"Johan Tibell"  writes:
...
The intent of the not-yet-existing Unicode string is to represent
text not bytes.
Right, so this will replace the .Char8 modules as well?  What confused
me was my misunderstanding Duncan to mean that Unicode text would
somehow imply shorter strings than non-Unicode (i.e. 8-bit) text.
Yes.
...
...
To give just one example, short (Unicode) strings are common as keys
in associative data structures like maps
I guess typically, you'd break things down to words, so strings of
lenght 4-10 or so.  BS uses three words and LBS four (IIRC), so the
cost of sharing typically outweighs the benefit.
I'm not sure if you would have much sharing in a map as the keys will be unique.
...
...
Can I also here insert a plea for keeping lazy I/O out of the new
Unicode module?
I use ByteString.Lazy almost exclusively.  I realize it there's a
penalty in time and space, but the ability to write applications that
stream over multi-Gb files is essential.
Lazy I/O comes with a penalty in terms of correctness! Pretending that
I/O and the underlying resource allocations (e.g. file handles) aren't
observable is bad. Lazy I/O is kinda, maybe usable for small scripts
that reads a file or two an spits out a result but for servers it
doesn't work at all. Lazy I/O requires unsafe* functions and is
therefore, unsafe. The finalizers required can be arbitrary complex
depending on what kind of resources need to be allocated. The simple
case is a file handle but there's no reason we might need sockets,
locks, etc to create the lazy ByteString. Here are two possible
interfaces for safe I/O. One isstream based one with explicit close
and the other fold based one (i.e. inversion of control):
...
import qualified Data.ByteString as S
-- Stream based I/O.
class InputStream s where
  read :: s -> IO Word8
  readN :: s -> Int -> IO S.ByteString  -- efficient block reads
  close :: s -> IO ()
openBinaryFile :: InputStream s => FilePath -> IO s
or a left fold over the file's content. The 'foldBytes' function can
close the file at EOF.
...
-- Left fold/callback based I/O.
foldBytes :: FilePath -> (seed -> Word8 -> Either seed seed) -> seed -> IO seed
-- Efficient block reads.
foldChunks :: FilePath -> (seed -> S.ByteString -> Either seed seed) -> seed -> IO seed
on top of this you might want monadic versions of the above two
functions. The case for a Unicode type are analogous.
...
Of course, these applications couldn't care less about Unicode, so
perhaps the usage is different.
The issue of lazy I/O is orthogonal to ByteString vs Unicode(String).

Cheers,

Johan