Re: [Haskell-cafe] Lazy IO and closing of file handles

15 Mar 2007


      ...
When using readFile to process a large number of files, I am exceeding
the resource limits for the maximum number of open file descriptors on
my system.  How can I enhance my program to deal with this situation
without making significant changes?
..
 1. Read contents of all files returning a list of Strings
 2. Map over the list and parse each String as an Email
 3. Sort the list of Emails
 4. Group the list of Emails by Subject
 5. Map over the grouped list to create a list of Threads
 6. Finally, print the list of Threads
this logical pipeline is actually executed as driven by demand from strict i/o 
operations, such as 6, which opens many files early, processing them late;
this leaves many files opened and partly read, grabbing file resources like mad.
as others have pointed out, readFile's input resources can be freed when 
either the file has been read to the end, or all references to readFile's output
have been dropped.
...
How do I go about fixing this without making a significant number of
changes to my program?  Did I make a mistake in steps 1 and 2 above?
Should I have read and parsed a single file at a time, and then move
on to the next?
in order to keep the overall structure, one could move readFile backwards
and parseEmail forwards in the pipeline, until the two meet. then make sure
that parseEmail completely constructs the internal representation of each
email, thereby keeping no implicit references to the external representation.

hth,
claus
...
module Main where
import Control.Monad (filterM, liftM)
import Data.List
import Data.Maybe
import System.Directory
import System.Environment
import System.IO
type From    = String
type Subject = String
data Email   = Email {from :: From, subject :: Subject} deriving Show
data Thread  = Thread [Email]
instance Show Thread where
    show (Thread emails@(e:es)) = title ++ senders
        where
          title   = newline . bolder . subject $ e
          sender  = newline . indent . from
          senders = concatMap sender emails
          newline = (++ "\n")
          indent  = ("  " ++)
          bolder  = ("\27[0;32;40m" ++) . (++ "\27[0m")
main =
    getArgs                 >>=
    mapM filesOfDirectory   >>=
    mapM (mapM processFile) >>=
    mapM_ print . threadEmails . concat
filesOfDirectory :: FilePath -> IO [String]
filesOfDirectory dir =
    fmap (map ((dir++"/")++)) (getDirectoryContents dir) >>=
    filterM doesFileExist         -- ignore directories
processFile path = do
  text <- readFile path
  return $! (parseEmail text)
parseEmail :: String -> Email
parseEmail text = 
    (Email $! (getHeader "From")) $! (getHeader "Subject")
        where
          strictly s= length s `seq` s
          getHeader = strictly . fromMaybe "N/A" . flip lookup headers
          headers   = concatMap mkassoc . takeWhile (/="") $ lines text
          mkassoc s = case findIndex (==':') s of
                        Just n  -> [(take n s, drop (n+2) s)]
                        Nothing -> []
threadEmails :: [Email] -> [Thread]
threadEmails =
    map Thread . groupBy (fuzzy (==)) . sortBy (fuzzy compare)
        where
          fuzzy fn e e' = stripReFwd (subject e) `fn` stripReFwd (subject e')
          stripReFwd    = stripSpaces . reverse . stripToColon . reverse
          stripSpaces   = dropWhile (==' ')
          stripToColon  = takeWhile (/=':')

Re: [Haskell-cafe] Lazy IO and closing of file handles

Claus Reinke