[Haskell-beginners] Re: Memory usage problem

24 Mar 2010

      Sami Liedes wrote:
...
Patrick LeBoutillier wrote:
...
I'm no profiling expert, but I have a few questions though:
- What is the size (in bytes) of your input file?
The input file contains 31,129,639 bytes, with 710,355 lines and 27,954
individual packages/records.
In fact it seems that even if the input stream repeats only the line
" a" (that is, a space and the letter a) infinitely, the program
eventually eats all memory. That's interesting.
So I guess the lines are read in, but then held in memory and lazily
not processed further until the entire file has been read, because
it's not necessary? Or something.
Yep, pretty much. Basically,  readRecords  returns a list of unevaluated
expressions of type  Package . But because they haven't been fully
evaluated yet, they still contain a lot of references to the input stream.

This wouldn't be a problem if the list itself were unevaluated, but it
appears that  sort  is forcing the spine of the list much earlier than
its elements, which means that you've read the whole file into memory
without also parsing each package into a  Package  record. This way, you
retain way too much of the input file.

A quick fix would be to intersperse a function between  sort  and
readRecords  that imposes proper evaluation order:

   import Control.Parallel.Strategies  -- or Control.DeepSeq

   instance NFData Package where
       rnf (Package n v) = rnf n `seq` rnf v `seq` ()

   processFile =
           unlines . map sprintPackage
         . sort . strictify . readRecords . lines
       where
       strictify []     = []
       strictify (x:xs) = rnf x `seq` (x : strictify xs)

Here  strictify  makes sure that each cons cell comes with a fully
evaluated element  x  while the list itself is still lazy. (Since  sort
 has to evaluate the list anyway, the latter doesn't really matter and
you could as well use

       strictify xs = rnf xs `seq` xs

)

While I expect the above to work, I hesitate to claim that it actually
does. The reason is that I don't understand your  readRecords  function
at a glance, unlike the  processFile  pipeline, it is not apparent how
it works. Since finding and fixing space leaks is easier if your code is
more obvious, I recommend to reformulate it as a function composition as
well, for instance something like this:

    import Control.Arrow ((&&&))
    import Data.Maybe (catMaybes)
    import Daya.List (stripPrefix)

    readRecords = map readRecord . splitByEmptyLines

    splitByEmptyLines = filter (not . null) . groupBy (\x y -> y /= "")

    readRecord =
            uncurry Package
          . (prefix "Name: " &&& prefix "Version: ")
          . filter (not . beginsWithSpace)
        where
        beginsWithSpace (' ':xs) = True
        beginsWithSpace _ = False

            -- be warned,  head  may raise an exception
        prefix s = head . catMaybes . map (stripPrefix s)

The rule is that explicit recursion should be replaced by high level
recursion schemes like  map , filter , fold , concat  etc. whenever
applicable.

This way, it's much easier to see how much of the input is retained for
each  Package . For instance, your original  readRecord  function could
read an arbitrary number of lines, while here, it's clear that
readRecord  can't cross the "boundary" imposed by  splitByEmptyLines .

Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com