
When using readFile to process a large number of files, I am exceeding the resource limits for the maximum number of open file descriptors on my system. How can I enhance my program to deal with this situation without making significant changes? My program takes one or more directories that contain email messages, stored one per file, and prints a list of all the email threads. Here is a snippet of output: New addition to the Kazmier family Casey Kazmier Memoization in Erlang? Thomas Johnsson Ulf Wiger \(AL/EAB\) As a newcomer to Haskell, I am intrigued by lazy evaluation and how it can influence one's designs. With that said, I wrote the program as a sequence of list manipulations which seemed quite natural to do in Haskell starting with reading the contents of the each file. Here is the algorithm at the high level: 1. Read contents of all files returning a list of Strings 2. Map over the list and parse each String as an Email 3. Sort the list of Emails 4. Group the list of Emails by Subject 5. Map over the grouped list to create a list of Threads 6. Finally, print the list of Threads It is my understanding that, as a result of lazy IO, the entire file does not need to be read into memory because parseEmail only inspects the topmost portion of the email (its headers), which is a key part of my design as some of the files can be quite large. Unfortunately, as soon as I run this program on a directory with more than 1024 files, GHC craps out on me due to resource limits. It seems that the handles opened by readFile remain open. Would this be common across all Haskell implementations? How do I go about fixing this without making a significant number of changes to my program? Did I make a mistake in steps 1 and 2 above? Should I have read and parsed a single file at a time, and then move on to the next? I'd appreciate any other comments on the program as well. I feel this is the best example of Haskell code that I have written. Compared to the first version of this program I wrote a few months ago, this is a hundred times better. Here is the program:
module Main where
import Control.Monad (filterM, liftM) import Data.List import Data.Maybe import System.Directory import System.Environment
type From = String type Subject = String data Email = Email {from :: From, subject :: Subject} deriving Show data Thread = Thread [Email]
instance Show Thread where show (Thread emails@(e:es)) = title ++ senders where title = newline . bolder . subject $ e sender = newline . indent . from senders = concatMap sender emails newline = (++ "\n") indent = (" " ++) bolder = ("\27[0;32;40m" ++) . (++ "\27[0m")
main = getArgs >>= mapM fileContentsOfDirectory >>= mapM_ print . threadEmails . map parseEmail . concat
fileContentsOfDirectory :: FilePath -> IO [String] fileContentsOfDirectory dir = setCurrentDirectory dir >> getDirectoryContents dir >>= filterM doesFileExist >>= -- ignore directories mapM readFile
parseEmail :: String -> Email parseEmail text = Email (getHeader "From") (getHeader "Subject") where getHeader = fromMaybe "N/A" . flip lookup headers headers = concatMap mkassoc . takeWhile (/="") $ lines text mkassoc s = case findIndex (==':') s of Just n -> [(take n s, drop (n+2) s)] Nothing -> []
threadEmails :: [Email] -> [Thread] threadEmails = map Thread . groupBy (fuzzy (==)) . sortBy (fuzzy compare) where fuzzy fn e e' = stripReFwd (subject e) `fn` stripReFwd (subject e') stripReFwd = stripSpaces . reverse . stripToColon . reverse stripSpaces = dropWhile (==' ') stripToColon = takeWhile (/=':')