
Sami,
I'm no profiling expert, but I have a few questions though:
- What is the size (in bytes) of your input file?
- Also, does memory usage improve if you remove the "sort"?
Patrick
On Sun, Mar 21, 2010 at 7:23 PM, Sami Liedes
Hi!
I wrote a simple program to parse Debian package list files (a simple text record format), sort the list of packages and output the list. I first wrote the program in C (which I speak fluently), but wanting to learn Haskell, have been trying to reimplement it in Haskell.
The problem is that my 60-line Haskell program, which doesn't yet do everything the C program does, uses huge amounts of memory, 1.5 gigabytes when parsing a single file. The C implementation can parse and sort more than twice that data in less than 10 megabytes. I can't figure out where the memory goes or how to fix it, but my guess would be it's some lazy computation thunk.
(See below after the code for description of the input and output formats.)
With a package file of 28000 records (packages), the memory usage is roughly 1.5 gigabytes when compiled with ghc - under hugs it just bails out quickly after stack overflow - and I'd appreciate help in figuring out
1) Where the memory is actually spent? 2) How would one usually go about figuring this out?
and
3) How to fix it? [4) I'd also love style hints and other ideas to make my code more idiomatic]
I've tried to randomly add some $! operators to the code (that doesn't feel right :-), but so far to no avail.
Here's the code:
------------------------------------------------------------ module Main where import List (sort) import Maybe (fromJust, mapMaybe)
-- Parse one line of the format "Field: value". -- Ignore those that start with a space. readField :: String -> Maybe (String,String) readField line = if line == "" || head line == ' ' then Nothing else let (name,':':' ':val) = break (== ':') line in Just (name,val)
data Package = Package { name, version :: !String } deriving (Show, Eq, Ord)
-- Use sprintPackage for formatted output, show for unformatted sprintPackage :: Package -> String sprintPackage (Package name ver) = name ++ replicate (50 - length name) ' ' ++ ver
-- Read one record worth of lines (separated by blank lines) getOneRecordLines [] = ([],[]) getOneRecordLines lines = (hd, if tl == [] then [] else tail tl) where (hd,tl) = break (== "") lines
-- Read one record worth of lines and parse, returning (Field, value) -- tuples and the rest of the lines (less the just read record) readRecordFields :: [String] -> ([(String,String)], [String]) readRecordFields lines = (mapMaybe readField rl, rest) where (rl,rest) = getOneRecordLines lines
-- Convert a list of (Field,value) tuples to a Package recordFieldsToPackage :: [(String,String)] -> Package recordFieldsToPackage fields = let names = [y | (x,y) <- fields, x=="Package"] vers = [y | (x,y) <- fields, x=="Version"] in case (names,vers) of ([], _) -> error "Package has no name." (a:b:rest, _) -> error ("Package has two names ("++a++","++b++").") ([n], []) -> error ("Package "++n++" has no version.") ([n], a:b:rest) -> error ("Package "++n++" has multiple versions.") ([n], [v]) -> Package n v
-- Read one record, returning Just Package if it contained a valid -- package and version, Nothing otherwise. readRecord :: [String] -> (Maybe Package,[String]) readRecord lines = if fields == [] then (Nothing, rest) else (Just (recordFieldsToPackage fields), rest) where (fields,rest) = readRecordFields lines
-- Converts the list (stream) of lines to a list of packages readRecords :: [String] -> [Package] readRecords [] = [] readRecords lines = let (rec,rest) = readRecord lines in case rec of Just pkg -> pkg : readRecords rest Nothing -> readRecords rest
processFile = unlines . (map sprintPackage) . sort . readRecords . lines
main :: IO () main = interact processFile ------------------------------------------------------------
The input (stdin) for the program is of the format:
------------------------------------------------------------ Package: somepackagename Version: someversionstring Other-Fields: ... Whatever: ... ... (lines that begin with space are just ignored)
Package: pkg2 Version: otherversion ... ------------------------------------------------------------
That is, the file has records separated by blank lines. Only lines that begin with "Package: " or "Version: " are considered in a record, the others are just ignored.
When finished, it outputs a formatted list of packages to stdout:
------------------------------------------------------------ a2ps 1:4.14-1 a2ps-perl-ja 1.45-5 a56 1.3-5 a7xpg 0.11.dfsg1-4 a7xpg-data 0.11.dfsg1-4 aa3d 1.0-8 aajm 0.4-3 aap 1.091-1 aap-doc 1.091-1 ... ------------------------------------------------------------
GHC profiling hints that the memory is spent in processFile and main, but no amount of adding $! to those functions seems to help:
------------------------------------------------------------ Thu Mar 18 22:37 2010 Time and Allocation Profiling Report (Final)
aptlistsh +RTS -p -RTS
total time = 3.18 secs (159 ticks @ 20 ms) total alloc = 1,873,922,000 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
processFile Main 44.0 61.9 main Main 32.1 20.7 readField Main 11.9 9.0 recordFieldsToPackage Main 4.4 2.0 newPkg Main 3.1 0.8 sprintPackage Main 1.3 2.1 readRecordFields Main 1.3 1.3 CAF Main 1.3 0.4 getOneRecordLines Main 0.6 1.4 ------------------------------------------------------------
Sami
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAEBCAAGBQJLpqppAAoJEKLT589SE0a08FoQALtZf+eO7u+YdVJuvZ6xxg/y DVHTnRdemPtlNuBJazV1RlTQovRNyQpX1CHX70MdnHLP4g9CzQhBsIeGTdgA9am+ pd9NjNFc1+ZDJJWAY24Kyx0pEfvyfLuUvF8KNInZGXeg+BriiFjFFU3RHqY0F3XD LIjAq/TaABmMGgahvG/y0EMRhduC+N9VskA4ivd7OCxvA7bGuzZEFxFbz2qqXrcE VXzE4ioha8C+WeX6djm/+CSoqN8o7bN1DS+fkEYr+jlU2jYG/xvnnfC8785r+9iP wZo1muIP4FVExZ3w5VSjLWkhMjgYDHLiuEs1S5dQdrgAHz+ea+PZqFyOoDEMyYvG vhnaTrPev2O4BZLAky5F13tj7fm1tV5CxJ9oq7fuurZoB7OdEpSguOyR3xYgoIYv spg7aqRlLuruX7VHbSu6pTtqWCEvxHxQbd+R2f5jqaDOsw+3n1enH6PFqTkiLpHc ReTLli4ploUGmiQRmbWpe2urECQ6QAXhqdx93vHwT2TUYVqL9wH84XsNvGBSt8Kq 5OowrJOlneeWre0SS1dn96ASE1oKKiCpc9rkGvD3tMxTONHFTPziyMtiC675ReH0 9XHvRkn9+8RV2gVnKeOMyfPT3ef/q2aP8uESf4nW6AAHNa41QWiB9/OYS32ljx/9 Cv/IvQVVIzSYQjNTO7a2 =yn0D -----END PGP SIGNATURE-----
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
-- ===================== Patrick LeBoutillier Rosemère, Québec, Canada