
When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors on my system. How can I enhance my program to deal with this situation without making significant changes? .. 1. Read contents of all files returning a list of Strings 2. Map over the list and parse each String as an Email 3. Sort the list of Emails 4. Group the list of Emails by Subject 5. Map over the grouped list to create a list of Threads 6. Finally, print the list of Threads
this logical pipeline is actually executed as driven by demand from strict i/o operations, such as 6, which opens many files early, processing them late; this leaves many files opened and partly read, grabbing file resources like mad. as others have pointed out, readFile's input resources can be freed when either the file has been read to the end, or all references to readFile's output have been dropped.
How do I go about fixing this without making a significant number of changes to my program? Did I make a mistake in steps 1 and 2 above? Should I have read and parsed a single file at a time, and then move on to the next?
in order to keep the overall structure, one could move readFile backwards and parseEmail forwards in the pipeline, until the two meet. then make sure that parseEmail completely constructs the internal representation of each email, thereby keeping no implicit references to the external representation. hth, claus
module Main where
import Control.Monad (filterM, liftM) import Data.List import Data.Maybe import System.Directory import System.Environment import System.IO
type From = String type Subject = String data Email = Email {from :: From, subject :: Subject} deriving Show data Thread = Thread [Email]
instance Show Thread where show (Thread emails@(e:es)) = title ++ senders where title = newline . bolder . subject $ e sender = newline . indent . from senders = concatMap sender emails newline = (++ "\n") indent = (" " ++) bolder = ("\27[0;32;40m" ++) . (++ "\27[0m")
main = getArgs >>= mapM filesOfDirectory >>= mapM (mapM processFile) >>= mapM_ print . threadEmails . concat
filesOfDirectory :: FilePath -> IO [String] filesOfDirectory dir = fmap (map ((dir++"/")++)) (getDirectoryContents dir) >>= filterM doesFileExist -- ignore directories
processFile path = do text <- readFile path return $! (parseEmail text)
parseEmail :: String -> Email parseEmail text = (Email $! (getHeader "From")) $! (getHeader "Subject") where strictly s= length s `seq` s getHeader = strictly . fromMaybe "N/A" . flip lookup headers headers = concatMap mkassoc . takeWhile (/="") $ lines text mkassoc s = case findIndex (==':') s of Just n -> [(take n s, drop (n+2) s)] Nothing -> []
threadEmails :: [Email] -> [Thread] threadEmails = map Thread . groupBy (fuzzy (==)) . sortBy (fuzzy compare) where fuzzy fn e e' = stripReFwd (subject e) `fn` stripReFwd (subject e') stripReFwd = stripSpaces . reverse . stripToColon . reverse stripSpaces = dropWhile (==' ') stripToColon = takeWhile (/=':')