
Hello I've just found time to finish writing my first "real world" program, so I thought I'd post it here and ask for insight on general issues such as if there's anything that isn't done "the Haskell way", or if there's something that could be done more efficiently. The code is at the bottom of this message and also at http://hpaste.org/4893. I realize it's a bit long, so if anyone could just skim through it and see if there's anything really ugly or stupid and point it out, it would be of great help :) Just to make it easier to follow the code, its idea is simple: - Build a process tree by reading entries from /proc (represented as a map); - Compare each child of the init process against a whitelist (which comes from a configuration file); - For each child not in the whitelist, send it a KILL signal. The idea here is to run this on webservers and try to catch bad customers who try to run daemons from their accounts, the typical script kiddie stuff. Anyway, there's one specific question I'd like to ask. I'm using "StateT PsTree IO" to keep program state (the process tree). IO is necessary because all information is read from /proc files. Now consider the following function: appendChild :: Pid -> Pid -> StateT PsTree IO Bool appendChild ppid pid = do tree <- get let PsInfo psData children = mapLookup ppid tree put $ Map.insert ppid (PsInfo psData (pid:children)) tree return True It changes the program state by modifying a process tree entry, but it does no I/O at all. The return type is there basically to match the return type of the function which calls it (insertParent), which calls functions that do I/O. Is there anyway to avoid the "IO" in appendChild's signature (other than making it a pure function by passing the process tree as a parameter and returning a modified map)? I would also like to try ways to improve efficiency, maybe trying a hash table instead of a map for the state, and also using bytestrings. I guess I could try making it parallel, since each child of init can be checked independently. Anyway, this is already longer than I thought it would be (I hope I'm not abusing too much :) The code follows. Thanks in advance for any comments or suggestions. Andre module Main where import qualified Data.Map as Map import Directory import Control.Monad.State import Maybe import System.Environment import System.IO import System.Posix.Files import System.Posix.Signals import System.Posix.Unistd import System.Posix.User import Text.Printf import Text.Regex import Text.Regex.Posix type Pid = FilePath type Uid = String type PsData = Map.Map String String type PsChildren = [Pid] type KillFunction = PsTree -> Pid -> IO () data PsInfo = PsInfo PsData PsChildren type PsTree = Map.Map Pid PsInfo type Whitelist = Map.Map FilePath String mapLookup :: (Ord a) => a -> Map.Map a b -> b mapLookup k = fromJust . Map.lookup k -- Process Tree construction parentPid :: PsInfo -> Pid parentPid (PsInfo psData _) = mapLookup "PPid" psData getProcInfo :: String -> PsData -> PsData getProcInfo line psData = do case line =~~ "^([A-Za-z]+):[[:space:]]+(.*)$" of Nothing -> psData Just ([_, key, value]:_) -> Map.insert key value psData getIds :: String -> PsData -> (String, String) getIds id psData = (rId, eId) where (rId:eId:_) = words (mapLookup id psData) processData :: String -> PsData processData procData = addIds psData where psData = foldr getProcInfo Map.empty (lines procData) addIds psData = Map.union psData (idMap psData) idMap psData = Map.fromList [("RUid", rUid), ("EUid", eUid), ("RGid", rGid), ("EGid", eGid)] (rUid, eUid) = getIds "Uid" psData (rGid, eGid) = getIds "Gid" psData readLink :: String -> IO String readLink link = catch (readSymbolicLink link) (\e -> return "?") procInfo :: Pid -> IO PsInfo procInfo pid = do let dir = "/proc/" ++ pid ++ "/" procData <- readFile $ dir ++ "status" exe <- readLink $ dir ++ "exe" cwd <- readLink $ dir ++ "cwd" cmd <- readFile $ dir ++ "cmdline" let cmd' = subRegex (mkRegex "[^a-zA-z[:space:]\\/\\.-]") cmd " " info = processData procData adminInfo = Map.fromList [("Exe", exe), ("Cwd", cwd), ("Cmd", cmd')] return $ PsInfo (Map.union info adminInfo) [] addProc :: Pid -> StateT PsTree IO PsInfo addProc pid = do info <- lift $ procInfo pid modify (Map.insert pid info) return info appendChild :: Pid -> Pid -> StateT PsTree IO Bool appendChild ppid pid = do tree <- get let PsInfo psData children = mapLookup ppid tree put $ Map.insert ppid (PsInfo psData (pid:children)) tree return True insertParent :: Pid -> Pid -> StateT PsTree IO Bool insertParent ppid pid = do tree <- get if Map.member ppid tree then appendChild ppid pid else do built <- insertInTree ppid if built then appendChild ppid pid else return False insertPid :: Pid -> StateT PsTree IO Bool insertPid "1" = do info <- addProc "1" return True insertPid pid = do info <- addProc pid let ppid = parentPid info if ppid == "0" then return False else insertParent ppid pid insertInTree :: Pid -> StateT PsTree IO Bool insertInTree pid = do tree <- get if Map.member pid tree then return True else insertPid pid buildTree :: FilePath -> StateT PsTree IO Bool buildTree entry | entry =~ "^[0-9]+$" = insertInTree entry | otherwise = return False createTree :: IO PsTree createTree = do entries <- getDirectoryContents "/proc" execStateT (mapM_ buildTree entries) Map.empty -- Process Tree pretty-printing treeStr :: PsTree -> PsChildren -> Int -> String -> String treeStr tree children level str = foldr append str children where append pid s = treeStr tree children' (level + 1) newstr where PsInfo _ children' = mapLookup pid tree pad = take (4 * level) [' ', ' ' ..] newstr = s ++ "\n" ++ pad ++ pid printTree :: PsTree -> Pid -> IO () printTree tree pid = putStrLn (treeStr tree children 1 pid) where PsInfo _ children = mapLookup pid tree -- Process killing tryToKill :: PsTree -> KillFunction -> Pid -> Uid -> Uid -> Bool -> IO Bool tryToKill tree killFun pid "0" allowedUid killed = do -- The process may be starting, give it a second chance. sleep 10 -- conservative value. let PsInfo psData _ = mapLookup pid tree if allowedUid /= mapLookup "EUid" psData then killFun tree pid >> return True else return (killed || False) tryToKill tree killFun pid _ _ _ = killFun tree pid >> return True buildWhitelist :: String -> Whitelist -> Whitelist buildWhitelist line whitelist = do case line =~~ "^[ \t]*([^: \t]+)[ \t]*:[ \t]*([^ \t]+)[ \t]*$" of Nothing -> error "Invalid configuration file" Just ([_, exe, user]:_) -> Map.insert exe user whitelist readWhiteList :: FilePath -> IO Whitelist readWhiteList file = do contents <- readFile file return $ foldr buildWhitelist Map.empty (lines contents) allowedUidForExecutable :: Whitelist -> FilePath -> IO Uid allowedUidForExecutable whitelist exe = do case Map.lookup exe whitelist of Nothing -> return "0" Just user -> do entry <- getUserEntryForName user return $ show (userID entry) processBastard :: PsTree -> Whitelist -> KillFunction -> Bool -> Pid -> IO Bool processBastard tree whitelist killFun killed pid = do let PsInfo psData _ = mapLookup pid tree euid = mapLookup "EUid" psData exe = mapLookup "Exe" psData allowedUid <- allowedUidForExecutable whitelist exe if euid /= allowedUid then tryToKill tree killFun pid euid allowedUid killed else return killed withEachBastard :: PsTree -> Whitelist -> KillFunction -> IO Bool withEachBastard tree whitelist killFun = foldM (processBastard tree whitelist killFun) False children where (PsInfo _ children) = mapLookup "1" tree printWarnings :: Pid -> PsData -> IO () printWarnings pid psData = do let exe = mapLookup "Exe" psData let cmd = mapLookup "Cmd" psData let cwd = mapLookup "Cwd" psData let ppid = mapLookup "PPid" psData let euid = mapLookup "EUid" psData let ruid = mapLookup "RUid" psData let egid = mapLookup "EGid" psData if ruid /= euid then hPrintf stderr "PID %s: RUID=%s, EUID=%s\n" pid ruid euid else return () hPrintf stderr "Killing proc %s (%s, UID=%s, GID=%s), child of %s\n" pid exe euid egid ppid hPrintf stderr " Process command line: %s\n" cmd hPrintf stderr " Process working directory: %s\n" cwd killTree :: KillFunction killTree tree pid = do let PsInfo psData children = mapLookup pid tree printWarnings pid psData signalProcess sigKILL (read pid) mapM_ (killTree tree) children killBastards :: PsTree -> Whitelist -> Int -> IO () killBastards tree whitelist n = do runAgain <- withEachBastard tree whitelist killTree if runAgain && n > 1 then do sleep 2 killBastards tree whitelist (n - 1) else return () -- Configuration printConfig :: KillFunction printConfig tree pid = do let PsInfo psData _ = mapLookup pid tree exe = mapLookup "Exe" psData euid = mapLookup "EUid" psData entry <- getUserEntryForID (read euid) putStrLn $ exe ++ ": " ++ (userName entry) -- main helpers config :: PsTree -> IO () config tree = do withEachBastard tree Map.empty printConfig return () pstree :: PsTree -> IO () pstree tree = do printTree tree "1" -- In newer kernels, process 2 is kthreadd, which is not a child -- of init. let info = mapLookup "2" tree if parentPid info /= "1" then printTree tree "2" else return () kill :: PsTree -> IO () kill tree = do whitelist <- readWhiteList "killbastards.conf" killBastards tree whitelist 5 main :: IO () main = do args <- getArgs tree <- createTree case args of ["config"] -> config tree ["pstree"] -> pstree tree [] -> kill tree

On 9 Jan 2008, at 7:57 PM, Andre Nathan wrote:
Hello
I've just found time to finish writing my first "real world" program, so I thought I'd post it here and ask for insight on general issues such as if there's anything that isn't done "the Haskell way", or if there's something that could be done more efficiently.
The code is at the bottom of this message and also at http://hpaste.org/4893. I realize it's a bit long, so if anyone could just skim through it and see if there's anything really ugly or stupid and point it out, it would be of great help :)
Just to make it easier to follow the code, its idea is simple:
- Build a process tree by reading entries from /proc (represented as a map); - Compare each child of the init process against a whitelist (which comes from a configuration file); - For each child not in the whitelist, send it a KILL signal.
The idea here is to run this on webservers and try to catch bad customers who try to run daemons from their accounts, the typical script kiddie stuff.
Anyway, there's one specific question I'd like to ask. I'm using "StateT PsTree IO" to keep program state (the process tree). IO is necessary because all information is read from /proc files. Now consider the following function:
appendChild :: Pid -> Pid -> StateT PsTree IO Bool appendChild ppid pid = do tree <- get let PsInfo psData children = mapLookup ppid tree put $ Map.insert ppid (PsInfo psData (pid:children)) tree return True
A return type of Bool suggests the code might fail; a constant function should have return type ().
It changes the program state by modifying a process tree entry, but it does no I/O at all. The return type is there basically to match the return type of the function which calls it (insertParent), which calls functions that do I/O. Is there anyway to avoid the "IO" in appendChild's signature (other than making it a pure function by passing the process tree as a parameter and returning a modified map)?
This is the best solution, as well as the most idiomatic. It's really simple, too: appendChild :: Pid -> Pid -> PsTree -> PsTree appendChild ppid pid tree = Map.insert ppid (PsInfo psData (pid:children)) tree where PsInfo psData children = mapLookup ppid tree Which is two lines shorter than your version, and IMHO just as clear; or, even better appendChild ppid pid = Map.alter (fmap $ \ (PsInfo psData children) -> PsInfo psData (pid:children)) ppid which is a one-liner. Alternatively, you could keep the definition, but change the type to appendChild :: Monad m => Pid -> Pid -> StateT PsTree m Bool or appendChild :: MonadState m PsTree => Pid -> Pid -> m Bool although this is likely to be less efficient.
I would also like to try ways to improve efficiency, maybe trying a hash table instead of a map for the state, and also using bytestrings. I guess I could try making it parallel, since each child of init can be checked independently.
Anyway, this is already longer than I thought it would be (I hope I'm not abusing too much :)
An actual coding question, abuse? We should be so lucky.
The code follows. Thanks in advance for any comments or suggestions.
Andre
module Main where
import qualified Data.Map as Map
Also import Data.Map (Map) (Map.Map looks kind of silly).
import Directory import Control.Monad.State import Maybe import System.Environment import System.IO import System.Posix.Files import System.Posix.Signals import System.Posix.Unistd import System.Posix.User import Text.Printf import Text.Regex import Text.Regex.Posix
type Pid = FilePath type Uid = String
type PsData = Map.Map String String type PsChildren = [Pid] type KillFunction = PsTree -> Pid -> IO ()
data PsInfo = PsInfo PsData PsChildren
This sequence is better written data PsInfo = PsInfo{ psData :: Map String String, psChildren :: [Pid] } If find myself using typedefs relatively infrequently in Haskell.
type PsTree = Map.Map Pid PsInfo
type Whitelist = Map.Map FilePath String
mapLookup :: (Ord a) => a -> Map.Map a b -> b mapLookup k = fromJust . Map.lookup k
-- Process Tree construction
parentPid :: PsInfo -> Pid parentPid (PsInfo psData _) = mapLookup "PPid" psData
getProcInfo :: String -> PsData -> PsData getProcInfo line psData = do case line =~~ "^([A-Za-z]+):[[:space:]]+(.*)$" of Nothing -> psData Just ([_, key, value]:_) -> Map.insert key value psData
getIds :: String -> PsData -> (String, String) getIds id psData = (rId, eId) where (rId:eId:_) = words (mapLookup id psData)
processData :: String -> PsData processData procData = addIds psData where psData = foldr getProcInfo Map.empty (lines procData) addIds psData = Map.union psData (idMap psData) idMap psData = Map.fromList [("RUid", rUid), ("EUid", eUid), ("RGid", rGid), ("EGid", eGid)] (rUid, eUid) = getIds "Uid" psData (rGid, eGid) = getIds "Gid" psData
readLink :: String -> IO String readLink link = catch (readSymbolicLink link) (\e -> return "?")
procInfo :: Pid -> IO PsInfo procInfo pid = do let dir = "/proc/" ++ pid ++ "/" procData <- readFile $ dir ++ "status" exe <- readLink $ dir ++ "exe" cwd <- readLink $ dir ++ "cwd" cmd <- readFile $ dir ++ "cmdline" let cmd' = subRegex (mkRegex "[^a-zA-z[:space:]\\/\\.-]") cmd " " info = processData procData adminInfo = Map.fromList [("Exe", exe), ("Cwd", cwd), ("Cmd", cmd')] return $ PsInfo (Map.union info adminInfo) []
addProc :: Pid -> StateT PsTree IO PsInfo addProc pid = do info <- lift $ procInfo pid modify (Map.insert pid info) return info
appendChild :: Pid -> Pid -> StateT PsTree IO Bool appendChild ppid pid = do tree <- get let PsInfo psData children = mapLookup ppid tree put $ Map.insert ppid (PsInfo psData (pid:children)) tree return True
As above, or appendChild ppid pid = Map.alter (fmap $ \ st -> st {children = pid : children st}) ppid with the record syntax.
insertParent :: Pid -> Pid -> StateT PsTree IO Bool insertParent ppid pid = do tree <- get if Map.member ppid tree then appendChild ppid pid
modify (appendChild ppid pid)
else do built <- insertInTree ppid if built then appendChild ppid pid else return False
insertPid :: Pid -> StateT PsTree IO Bool insertPid "1" = do info <- addProc "1" return True insertPid pid = do info <- addProc pid let ppid = parentPid info if ppid == "0" then return False else insertParent ppid pid
insertInTree :: Pid -> StateT PsTree IO Bool insertInTree pid = do tree <- get if Map.member pid tree then return True else insertPid pid
buildTree :: FilePath -> StateT PsTree IO Bool buildTree entry | entry =~ "^[0-9]+$" = insertInTree entry | otherwise = return False
This function is fairly complicated, simply because of the number of separate definitions involved; I would be looking for opportunities to inline definitions here, so it's clearer what the definitions are. (Also, I would try to build a single, self-recursive function at the top level, put the call to procInfo there, and make everything else pure).
createTree :: IO PsTree createTree = do entries <- getDirectoryContents "/proc" execStateT (mapM_ buildTree entries) Map.empty
-- Process Tree pretty-printing
treeStr :: PsTree -> PsChildren -> Int -> String -> String treeStr tree children level str = foldr append str children where append pid s = treeStr tree children' (level + 1) newstr where PsInfo _ children' = mapLookup pid tree pad = take (4 * level) [' ', ' ' ..] newstr = s ++ "\n" ++ pad ++ pid
printTree :: PsTree -> Pid -> IO () printTree tree pid = putStrLn (treeStr tree children 1 pid) where PsInfo _ children = mapLookup pid tree
-- Process killing
tryToKill :: PsTree -> KillFunction -> Pid -> Uid -> Uid -> Bool -> IO Bool tryToKill tree killFun pid "0" allowedUid killed = do -- The process may be starting, give it a second chance. sleep 10 -- conservative value. let PsInfo psData _ = mapLookup pid tree if allowedUid /= mapLookup "EUid" psData then killFun tree pid >> return True else return (killed || False) tryToKill tree killFun pid _ _ _ = killFun tree pid >> return True
buildWhitelist :: String -> Whitelist -> Whitelist buildWhitelist line whitelist = do case line =~~ "^[ \t]*([^: \t]+)[ \t]*:[ \t]*([^ \t]+)[ \t]*$" of Nothing -> error "Invalid configuration file" Just ([_, exe, user]:_) -> Map.insert exe user whitelist
readWhiteList :: FilePath -> IO Whitelist readWhiteList file = do contents <- readFile file return $ foldr buildWhitelist Map.empty (lines contents)
allowedUidForExecutable :: Whitelist -> FilePath -> IO Uid allowedUidForExecutable whitelist exe = do case Map.lookup exe whitelist of Nothing -> return "0" Just user -> do entry <- getUserEntryForName user return $ show (userID entry)
processBastard :: PsTree -> Whitelist -> KillFunction -> Bool -> Pid -> IO Bool processBastard tree whitelist killFun killed pid = do let PsInfo psData _ = mapLookup pid tree euid = mapLookup "EUid" psData exe = mapLookup "Exe" psData allowedUid <- allowedUidForExecutable whitelist exe if euid /= allowedUid then tryToKill tree killFun pid euid allowedUid killed else return killed
withEachBastard :: PsTree -> Whitelist -> KillFunction -> IO Bool withEachBastard tree whitelist killFun = foldM (processBastard tree whitelist killFun) False children where (PsInfo _ children) = mapLookup "1" tree
printWarnings :: Pid -> PsData -> IO () printWarnings pid psData = do let exe = mapLookup "Exe" psData let cmd = mapLookup "Cmd" psData let cwd = mapLookup "Cwd" psData let ppid = mapLookup "PPid" psData let euid = mapLookup "EUid" psData let ruid = mapLookup "RUid" psData let egid = mapLookup "EGid" psData if ruid /= euid then hPrintf stderr "PID %s: RUID=%s, EUID=%s\n" pid ruid euid else return () hPrintf stderr "Killing proc %s (%s, UID=%s, GID=%s), child of %s\n" pid exe euid egid ppid hPrintf stderr " Process command line: %s\n" cmd hPrintf stderr " Process working directory: %s\n" cwd
killTree :: KillFunction killTree tree pid = do let PsInfo psData children = mapLookup pid tree printWarnings pid psData signalProcess sigKILL (read pid) mapM_ (killTree tree) children
killBastards :: PsTree -> Whitelist -> Int -> IO () killBastards tree whitelist n = do runAgain <- withEachBastard tree whitelist killTree if runAgain && n > 1 then do sleep 2 killBastards tree whitelist (n - 1) else return ()
-- Configuration
printConfig :: KillFunction printConfig tree pid = do let PsInfo psData _ = mapLookup pid tree exe = mapLookup "Exe" psData euid = mapLookup "EUid" psData entry <- getUserEntryForID (read euid) putStrLn $ exe ++ ": " ++ (userName entry)
I wouldn't call this a KillFunction; in fact, I would probably just inline the definition of KillFunction throughout. An expression that has to be decoded is better than a name that is misleading.
-- main helpers
config :: PsTree -> IO () config tree = do withEachBastard tree Map.empty printConfig return ()
pstree :: PsTree -> IO () pstree tree = do printTree tree "1" -- In newer kernels, process 2 is kthreadd, which is not a child -- of init. let info = mapLookup "2" tree if parentPid info /= "1" then printTree tree "2" else return ()
kill :: PsTree -> IO () kill tree = do whitelist <- readWhiteList "killbastards.conf" killBastards tree whitelist 5
main :: IO () main = do args <- getArgs tree <- createTree case args of ["config"] -> config tree ["pstree"] -> pstree tree [] -> kill tree

Hi Jonathan On Wed, 2008-01-09 at 21:32 -0800, Jonathan Cast wrote:
An actual coding question, abuse? We should be so lucky.
:) Your comments are much appreciated.
This function is fairly complicated, simply because of the number of separate definitions involved; I would be looking for opportunities to inline definitions here, so it's clearer what the definitions are. (Also, I would try to build a single, self-recursive function at the top level, put the call to procInfo there, and make everything else pure).
I rewrote insertInTree like below. Now it is the only function that has a StateT return type, and I also got rid of addProc, insertPid and insertParent :) insertInTree :: Pid -> StateT PsTree IO () insertInTree pid = do tree <- get if Map.member pid tree then return () else do info <- lift $ procInfo pid modify (Map.insert pid info) let ppid = parentPid info if ppid /= "0" then do insertInTree ppid modify (appendChild ppid pid) else return () I also rewrote createTree like this: createTree :: IO PsTree createTree = do entries <- getDirectoryContents "/proc" let procs = filter (=~ "^[0-9]+$") entries execStateT (mapM_ insertInTree procs) Map.empty Is that a bad way to do it? If haskell wasn't lazy this would be 3 O(n) operations, and I could write it using readDirStream to process all entries in one pass. I'm not sure if that's really necessary when laziness is present though. Thanks a lot for the other comments. I'll look into using a record for PsInfo now. Best, Andre

On 10 Jan 2008, at 10:21 AM, Andre Nathan wrote:
Hi Jonathan
On Wed, 2008-01-09 at 21:32 -0800, Jonathan Cast wrote:
An actual coding question, abuse? We should be so lucky.
:) Your comments are much appreciated.
You're welcome.
This function is fairly complicated, simply because of the number of separate definitions involved; I would be looking for opportunities to inline definitions here, so it's clearer what the definitions are. (Also, I would try to build a single, self-recursive function at the top level, put the call to procInfo there, and make everything else pure).
I rewrote insertInTree like below. Now it is the only function that has a StateT return type, and I also got rid of addProc, insertPid and insertParent :)
insertInTree :: Pid -> StateT PsTree IO () insertInTree pid = do tree <- get if Map.member pid tree then return () else do info <- lift $ procInfo pid modify (Map.insert pid info) let ppid = parentPid info if ppid /= "0" then do insertInTree ppid modify (appendChild ppid pid) else return ()
I also rewrote createTree like this:
createTree :: IO PsTree createTree = do entries <- getDirectoryContents "/proc" let procs = filter (=~ "^[0-9]+$") entries execStateT (mapM_ insertInTree procs) Map.empty
Is that a bad way to do it? If haskell wasn't lazy this would be 3 O (n) operations, and I could write it using readDirStream to process all entries in one pass. I'm not sure if that's really necessary when laziness is present though.
It might be faster; laziness usually has higher constants than direct implementations. But I doubt the difference is critical in this case, and I would definitely time a re-writing and throw it away unless it was significantly faster. But I don't think this is a case where laziness actually alters either the time or the space asymptotics of the algorithm (you end up creating an ~ O(n) tree anyway, so I'd figure O(n) space was OK for the loop, too). jcc

On Thu, 2008-01-10 at 20:37 -0800, Jonathan Cast wrote:
It might be faster; laziness usually has higher constants than direct implementations. But I doubt the difference is critical in this case, and I would definitely time a re-writing and throw it away unless it was significantly faster. But I don't think this is a case where laziness actually alters either the time or the space asymptotics of the algorithm (you end up creating an ~ O(n) tree anyway, so I'd figure O(n) space was OK for the loop, too).
I was wondering if laziness would make it run as if it was a single O(n) operation (get one directory entry "on demand", pass it to filter and then to insertInTree), because only one entry is used at a time, so that only the "current" entry needs to be evaluated. I still find it hard to evaluate the effects of laziness on the complexity of programs... Andre

On Fri, 2008-01-11 at 13:27 -0200, Andre Nathan wrote:
I was wondering if laziness would make it run as if it was a single O(n) operation (get one directory entry "on demand", pass it to filter and then to insertInTree), because only one entry is used at a time, so that only the "current" entry needs to be evaluated.
I did some experiments. This for 1000 reads of the entries of a directory with 10000 entries, checking if they match ^[0-9]+$ and printing those which do (half of them). getDirectoryEntries, filter, mapM_: 65.52s user 6.01s system 87% cpu 1:22.21 total openDirStream, readDirStream: 39.68s user 5.69s system 95% cpu 47.746 total getDirectoryContents, mapM_: 63.40s user 5.96s system 95% cpu 1:12.70 total Both versions which use getDirectoryContents also use much more memory than the one which uses readDirStream (about 8M vs about 2M). Maybe I'm not exploting getDirectoryContents' laziness correctly? I expected the second and third versions to run in about the same time. Andre

On Fri, 2008-01-11 at 20:20 -0200, Andre Nathan wrote:
Both versions which use getDirectoryContents also use much more memory than the one which uses readDirStream (about 8M vs about 2M). Maybe I'm not exploting getDirectoryContents' laziness correctly? I expected the second and third versions to run in about the same time.
Forgot to paste the code... foo :: IO () foo = do entries <- getDirectoryContents "." let procs = filter (=~ "^[0-9]+$") entries mapM_ putStrLn procs processEntry :: DirStream -> IO () processEntry ds = do entry <- readDirStream ds if entry =~ "^[0-9]+$" then do putStrLn entry processEntry ds else if entry == "" then return () else processEntry ds bar :: IO () bar = do ds <- openDirStream "." processEntry ds closeDirStream ds processEntry' :: FilePath -> IO () processEntry' entry = do if entry =~ "^[0-9]+$" then putStrLn entry else return () baz :: IO () baz = do entries <- getDirectoryContents "." mapM_ processEntry' entries main = forM_ [1..1000] $ \_ -> foo {- bar -} {- baz -}

On 11 Jan 2008, at 2:24 PM, Andre Nathan wrote:
processEntry :: DirStream -> IO () processEntry ds = do entry <- readDirStream ds if entry =~ "^[0-9]+$" then do putStrLn entry processEntry ds else if entry == "" then return () else processEntry ds
bar :: IO () bar = do ds <- openDirStream "." processEntry ds closeDirStream ds
This is a 200% increase in code size for a 75% decrease in execution time. (And, in general, the complexity can be much higher). That's an engineering tradeoff, and one you'll have to decide yourself; there's not much that can be done to make it go away. jcc

On 11 Jan 2008, at 2:20 PM, Andre Nathan wrote:
On Fri, 2008-01-11 at 13:27 -0200, Andre Nathan wrote:
I was wondering if laziness would make it run as if it was a single O(n) operation (get one directory entry "on demand", pass it to filter and then to insertInTree), because only one entry is used at a time, so that only the "current" entry needs to be evaluated.
I did some experiments. This for 1000 reads of the entries of a directory with 10000 entries, checking if they match ^[0-9]+$ and printing those which do (half of them).
getDirectoryEntries, filter, mapM_: 65.52s user 6.01s system 87% cpu 1:22.21 total
openDirStream, readDirStream: 39.68s user 5.69s system 95% cpu 47.746 total
getDirectoryContents, mapM_: 63.40s user 5.96s system 95% cpu 1:12.70 total
These are all known and expected. As I said, you can expect lazy versions to normally be slower than explicit loops. The question is whether 50% more time and 300% more memory has a higher cost in your case than the extra complexity and reduced modularity of the lazy code. jcc

On Fri, 2008-01-11 at 19:14 -0800, Jonathan Cast wrote:
These are all known and expected. As I said, you can expect lazy versions to normally be slower than explicit loops. The question is whether 50% more time and 300% more memory has a higher cost in your case than the extra complexity and reduced modularity of the lazy code.
I think I understand... I expected the getDirectoryContents + mapM_ to have about the same memory usage of the readDirStream version, because getDirectoryContents would lazily give me one entry at a time, but the list of entries returned by it ends up being created anyway, hence the larger memory usage, as the readDirStream version never builds a list. Andre

On 12 Jan 2008, at 7:19 AM, Andre Nathan wrote:
On Fri, 2008-01-11 at 19:14 -0800, Jonathan Cast wrote:
These are all known and expected. As I said, you can expect lazy versions to normally be slower than explicit loops. The question is whether 50% more time and 300% more memory has a higher cost in your case than the extra complexity and reduced modularity of the lazy code.
I think I understand... I expected the getDirectoryContents + mapM_ to have about the same memory usage of the readDirStream version, because getDirectoryContents would lazily give me one entry at a time, but the list of entries returned by it ends up being created anyway,
A nit: the list is almost certainly getting created lazily, or you'd get more than 300% more memory usage. But you still get the list's cons cells as your bookkeeping baggage, and they take up space in exchange for greater flexibility.
hence the larger memory usage, as the readDirStream version never builds a list.
jcc

On Sat, 2008-01-12 at 10:11 -0800, Jonathan Cast wrote:
A nit: the list is almost certainly getting created lazily, or you'd get more than 300% more memory usage. But you still get the list's cons cells as your bookkeeping baggage, and they take up space in exchange for greater flexibility.
But when I'm processing, say, the last directory entry, I have memory allocated for the other 9999 entries that have already been processed, right? I think that's where the 8M vs 2M difference comes from. Andre

On 12 Jan 2008, at 10:26 AM, Andre Nathan wrote:
On Sat, 2008-01-12 at 10:11 -0800, Jonathan Cast wrote:
A nit: the list is almost certainly getting created lazily, or you'd get more than 300% more memory usage. But you still get the list's cons cells as your bookkeeping baggage, and they take up space in exchange for greater flexibility.
But when I'm processing, say, the last directory entry, I have memory allocated for the other 9999 entries that have already been processed, right?
No. That would lead to a much larger difference. What you're seeing is the result of allocating a node in a linked list, not just a single directory entry. getDirectoryContents in this case certainly does not read in the entire directory, nor does it allocate space for it --- either of those would require much more than a 4x increase in the amount of memory.
I think that's where the 8M vs 2M difference comes from.
jcc

On 12 Jan 2008, at 10:26 AM, Andre Nathan wrote:
On Sat, 2008-01-12 at 10:11 -0800, Jonathan Cast wrote:
A nit: the list is almost certainly getting created lazily, or you'd get more than 300% more memory usage. But you still get the list's cons cells as your bookkeeping baggage, and they take up space in exchange for greater flexibility.
But when I'm processing, say, the last directory entry, I have memory allocated for the other 9999 entries that have already been processed,
Wait, the last entry? If you're just printing out the values, then no --- those should have been garbage collected already. jcc

On Sat, 2008-01-12 at 16:00 -0800, Jonathan Cast wrote:
Wait, the last entry? If you're just printing out the values, then no --- those should have been garbage collected already.
Won't they be garbage collected only after the last entry is used, though? Since getDirectoryEntries returns a list, won't its elements have to be kept until the list is not used anymore, i.e., after the last entry is processed? Andre

On Jan 13, 2008 12:42 AM, Andre Nathan
On Sat, 2008-01-12 at 16:00 -0800, Jonathan Cast wrote:
Wait, the last entry? If you're just printing out the values, then no --- those should have been garbage collected already.
Won't they be garbage collected only after the last entry is used, though? Since getDirectoryEntries returns a list, won't its elements have to be kept until the list is not used anymore, i.e., after the last entry is processed?
Well, if you're using the list like this: map (\i -> f (list !! i)) [0..10000] Then yes (it will not be garbage collected), but if you're using the list like this: map f list Then no (depending on the surroundings, of course). Recall what a list is: data List a = Empty | Cons a (List a) So once you process the first element and move to its tail, if there are no references to the original list, only its tail, then the first element will be garbage collected. Which is why you can do things like: filter isPowerOfTwo [1..] And get a list back without running out of memory when you get as high as 2^32. Luke

On 12 Jan 2008, at 4:42 PM, Andre Nathan wrote:
On Sat, 2008-01-12 at 16:00 -0800, Jonathan Cast wrote:
Wait, the last entry? If you're just printing out the values, then no --- those should have been garbage collected already.
Won't they be garbage collected only after the last entry is used, though?
No. A cell is generated once either (a) it, or (b) a cell earlier in the list, is required. A cell is garbage collected once both (a) it, and (b) all cells earlier in the list, are no longer required. So if you process the list strictly sequentially, it will be generated and garbage collected sequentially.
Since getDirectoryEntries returns a list, won't its elements have to be kept until the list is not used anymore, i.e., after the last entry is processed?
jcc

On Sat, 2008-01-12 at 21:39 -0800, Jonathan Cast wrote:
No. A cell is generated once either (a) it, or (b) a cell earlier in the list, is required. A cell is garbage collected once both (a) it, and (b) all cells earlier in the list, are no longer required. So if you process the list strictly sequentially, it will be generated and garbage collected sequentially.
Understood. Thanks Jonathan and Luke for your answers! Best, Andre

On 11 Jan 2008, at 7:27 AM, Andre Nathan wrote:
On Thu, 2008-01-10 at 20:37 -0800, Jonathan Cast wrote:
It might be faster; laziness usually has higher constants than direct implementations. But I doubt the difference is critical in this case, and I would definitely time a re-writing and throw it away unless it was significantly faster. But I don't think this is a case where laziness actually alters either the time or the space asymptotics of the algorithm (you end up creating an ~ O(n) tree anyway, so I'd figure O(n) space was OK for the loop, too).
I was wondering if laziness would make it run as if it was a single O(n) operation (get one directory entry "on demand", pass it to filter and then to insertInTree),
That should be the evaluation order, yes. I think big-O notation is the wrong notation for this; you get a single loop with three sequenced operations, rather than three sequenced loops, but both are O(n).
because only one entry is used at a time, so that only the "current" entry needs to be evaluated.
I still find it hard to evaluate the effects of laziness on the complexity of programs...
jcc
participants (3)
-
Andre Nathan
-
Jonathan Cast
-
Luke Palmer