Memory usage problem

21 Mar 2010

      Hi!

I wrote a simple program to parse Debian package list files (a simple
text record format), sort the list of packages and output the list. I
first wrote the program in C (which I speak fluently), but wanting to
learn Haskell, have been trying to reimplement it in Haskell.

The problem is that my 60-line Haskell program, which doesn't yet do
everything the C program does, uses huge amounts of memory, 1.5
gigabytes when parsing a single file. The C implementation can parse
and sort more than twice that data in less than 10 megabytes. I can't
figure out where the memory goes or how to fix it, but my guess would
be it's some lazy computation thunk.

(See below after the code for description of the input and output
formats.)

With a package file of 28000 records (packages), the memory usage is
roughly 1.5 gigabytes when compiled with ghc - under hugs it just
bails out quickly after stack overflow - and I'd appreciate help in
figuring out

1) Where the memory is actually spent?
2) How would one usually go about figuring this out?

and

3) How to fix it?
[4) I'd also love style hints and other ideas to make my code more idiomatic]

I've tried to randomly add some $! operators to the code (that doesn't
feel right :-), but so far to no avail.

Here's the code:

------------------------------------------------------------
module Main where
import List (sort)
import Maybe (fromJust, mapMaybe)

-- Parse one line of the format "Field: value".
-- Ignore those that start with a space.
readField :: String -> Maybe (String,String)
readField line =
  if line == "" || head line == ' ' then Nothing else
    let (name,':':' ':val) = break (== ':') line in
    Just (name,val)

data Package = Package { name, version :: !String }
             deriving (Show, Eq, Ord)

-- Use sprintPackage for formatted output, show for unformatted
sprintPackage :: Package -> String
sprintPackage (Package name ver) =
  name ++ replicate (50 - length name) ' ' ++ ver

-- Read one record worth of lines (separated by blank lines)
getOneRecordLines [] = ([],[])
getOneRecordLines lines = (hd, if tl == [] then [] else tail tl) where
  (hd,tl) = break (== "") lines

-- Read one record worth of lines and parse, returning (Field, value)
-- tuples and the rest of the lines (less the just read record)
readRecordFields :: [String] -> ([(String,String)], [String])
readRecordFields lines = (mapMaybe readField rl, rest) where
  (rl,rest) = getOneRecordLines lines

-- Convert a list of (Field,value) tuples to a Package
recordFieldsToPackage :: [(String,String)] -> Package
recordFieldsToPackage fields =
  let names = [y | (x,y) <- fields, x=="Package"]
      vers = [y | (x,y) <- fields, x=="Version"]
  in
   case (names,vers) of
     ([], _) -> error "Package has no name."
     (a:b:rest, _) -> error ("Package has two names ("++a++","++b++").")
     ([n], []) -> error ("Package "++n++" has no version.")
     ([n], a:b:rest) -> error ("Package "++n++" has multiple versions.")
     ([n], [v]) -> Package n v

-- Read one record, returning Just Package if it contained a valid
-- package and version, Nothing otherwise.
readRecord :: [String] -> (Maybe Package,[String])
readRecord lines =
  if fields == [] then (Nothing, rest)
  else (Just (recordFieldsToPackage fields), rest) where
    (fields,rest) = readRecordFields lines

-- Converts the list (stream) of lines to a list of packages
readRecords :: [String] -> [Package]
readRecords [] = []
readRecords lines =
  let (rec,rest) = readRecord lines in
  case rec of
    Just pkg -> pkg : readRecords rest
    Nothing -> readRecords rest

processFile = unlines . (map sprintPackage) . sort . readRecords . lines

main :: IO ()
main = interact processFile
------------------------------------------------------------

The input (stdin) for the program is of the format:

------------------------------------------------------------
Package: somepackagename
Version: someversionstring
Other-Fields: ...
Whatever: ...
 ... (lines that begin with space are just ignored)

Package: pkg2
Version: otherversion
...
------------------------------------------------------------

That is, the file has records separated by blank lines. Only lines
that begin with "Package: " or "Version: " are considered in a record,
the others are just ignored.

When finished, it outputs a formatted list of packages to stdout:

------------------------------------------------------------
a2ps                                              1:4.14-1
a2ps-perl-ja                                      1.45-5
a56                                               1.3-5
a7xpg                                             0.11.dfsg1-4
a7xpg-data                                        0.11.dfsg1-4
aa3d                                              1.0-8
aajm                                              0.4-3
aap                                               1.091-1
aap-doc                                           1.091-1
...
------------------------------------------------------------

GHC profiling hints that the memory is spent in processFile and main,
but no amount of adding $! to those functions seems to help:

------------------------------------------------------------
        Thu Mar 18 22:37 2010 Time and Allocation Profiling Report  (Final)

           aptlistsh +RTS -p -RTS

        total time  =        3.18 secs   (159 ticks @ 20 ms)
        total alloc = 1,873,922,000 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

processFile                    Main                  44.0   61.9
main                           Main                  32.1   20.7
readField                      Main                  11.9    9.0
recordFieldsToPackage          Main                   4.4    2.0
newPkg                         Main                   3.1    0.8
sprintPackage                  Main                   1.3    2.1
readRecordFields               Main                   1.3    1.3
CAF                            Main                   1.3    0.4
getOneRecordLines              Main                   0.6    1.4
------------------------------------------------------------

	Sami

Sami Liedes

Patrick LeBoutillier

Sami Liedes

Sami Liedes

Henk-Jan van Tuyl

Heinrich Apfelmus

Sami Liedes

Heinrich Apfelmus

tags

participants (4)