[Haskell-cafe] Re: Processing of large files

3 Nov 2004

      John Goerzen writes:
...
...
Given that the block-oriented approach has constant space
requirements, I am fairly confident it would save memory.
...
Perhaps a bit, but not a significant amount.
I see.
...
...
...
[read/processing blocks] would likely just make the
code a lot more complex. [...]
...
...
Either your algorithm can process the input in blocks or
it cannot. If it can, it doesn't make one bit a
difference if you do I/O in blocks, because your
algorithm processes blocks anyway.
...
Yes it does. If you don't set block buffering, GHC will
call read() separately for *every* single character.
I referred to the alleged complication of code, not to
whether the handle's 'BufferingMode' influences the
performance or not.
...
(I've straced stuff!)
How many read(2) calls does this code need?

  import System.IO
  import Control.Monad         ( when )
  import Foreign.Marshal.Array ( allocaArray, peekArray )
  import Data.Word             ( Word8 )

  main :: IO ()
  main = do
    h <- openBinaryFile "/etc/profile" ReadMode
    hSetBuffering h NoBuffering
    n <- fmap cast (hFileSize h)
    buf <- allocaArray n $ \ptr -> do
      rc <- hGetBuf h ptr n
      when (rc /= n) (fail "huh?")
      buf' <- peekArray n ptr :: IO [Word8]
      return (map cast buf')
    putStr buf
    hClose h

  cast :: (Enum a, Enum b) => a -> b
  cast = toEnum . fromEnum
...
It's a lot more efficient if you set block buffering in
your input, even if you are using interact and lines or
words to process it.
Of course it is. Which is why an I/O-bound algorithm should
process blocks. It's more efficient. And uses slightly less
memory, too. Although I have been told it's not a
insignificant amount.

Peter

[Haskell-cafe] Re: Processing of large files

Peter Simons