
jeff:
Alright, I've been hacking away at what I posted the other day, and I have something that works for files that will fit entirely into memory. And then I figured out why I've been restricted to files that fit entirely into memory... One of my functions is causing the entire thing to be read in, when, in the way I analyze the program, only a very small portion of the file should be read in. Here are the functions I've used to test this problem...
import Data.Bits import qualified Data.ByteString.Lazy as BS import Foreign.C.Types ...
{-# INLINE decodeLengthBits #-} decodeLengthBits :: BS.ByteString -> CInt decodeLengthBits doc = (shift (pieces !! 3) 24) .|. (shift (pieces !! 2) 16) .|. (shift (pieces !! 1) 8) .|. (pieces !! 0) where pieces::[CInt] = map fromIntegral . BS.unpack . BS.take 4 $ doc
breakIntoDocuments :: RawDocument -> [RawDocument] breakIntoDocuments f | BS.length f > 0 = if len > 0
Argh! -- | /O(n\/c)/ 'length' returns the length of a ByteString as an 'Int64' length :: ByteString -> Int64 length (LPS ss) = L.foldl' (\n ps -> n + fromIntegral (P.length ps)) 0 ss that'll force the whole file (it is the sum of the length of each chunk). Try comparing against the null bytestring, -- | /O(1)/ Test whether a ByteString is empty. null :: ByteString -> Bool null (LPS []) = True null (_) = False :-)
then (BS.take bytes f) : (breakIntoDocuments (BS.drop bytes f)) else (breakIntoDocuments (BS.drop bytes f)) | otherwise = [] where len = decodeLengthBits f bytes = fromIntegral (len * 2 + len * 4 + 4)
and a main function of:
main = do f <- B.readFile "Documents.bin" print (take 1 (breakIntoDocuments f))
Shouldn't the program only read in enough of the lazy byte-string to create the first return value of breakIntoDocuments? The return value of decodeLengthBits is only 277. I watched it, and it's reading in my whole 2gb file...
-- Jeff
Got to be more lazy :-) -- Don