Sami Liedes wrote:
Patrick LeBoutillier wrote:
I'm no profiling expert, but I have a few questions though:
- What is the size (in bytes) of your input file?
The input file contains 31,129,639 bytes, with 710,355 lines and 27,954 individual packages/records.
In fact it seems that even if the input stream repeats only the line " a" (that is, a space and the letter a) infinitely, the program eventually eats all memory. That's interesting.
So I guess the lines are read in, but then held in memory and lazily not processed further until the entire file has been read, because it's not necessary? Or something.
Yep, pretty much. Basically, readRecords returns a list of unevaluated expressions of type Package . But because they haven't been fully evaluated yet, they still contain a lot of references to the input stream. This wouldn't be a problem if the list itself were unevaluated, but it appears that sort is forcing the spine of the list much earlier than its elements, which means that you've read the whole file into memory without also parsing each package into a Package record. This way, you retain way too much of the input file. A quick fix would be to intersperse a function between sort and readRecords that imposes proper evaluation order: import Control.Parallel.Strategies -- or Control.DeepSeq instance NFData Package where rnf (Package n v) = rnf n `seq` rnf v `seq` () processFile = unlines . map sprintPackage . sort . strictify . readRecords . lines where strictify [] = [] strictify (x:xs) = rnf x `seq` (x : strictify xs) Here strictify makes sure that each cons cell comes with a fully evaluated element x while the list itself is still lazy. (Since sort has to evaluate the list anyway, the latter doesn't really matter and you could as well use strictify xs = rnf xs `seq` xs ) While I expect the above to work, I hesitate to claim that it actually does. The reason is that I don't understand your readRecords function at a glance, unlike the processFile pipeline, it is not apparent how it works. Since finding and fixing space leaks is easier if your code is more obvious, I recommend to reformulate it as a function composition as well, for instance something like this: import Control.Arrow ((&&&)) import Data.Maybe (catMaybes) import Daya.List (stripPrefix) readRecords = map readRecord . splitByEmptyLines splitByEmptyLines = filter (not . null) . groupBy (\x y -> y /= "") readRecord = uncurry Package . (prefix "Name: " &&& prefix "Version: ") . filter (not . beginsWithSpace) where beginsWithSpace (' ':xs) = True beginsWithSpace _ = False -- be warned, head may raise an exception prefix s = head . catMaybes . map (stripPrefix s) The rule is that explicit recursion should be replaced by high level recursion schemes like map , filter , fold , concat etc. whenever applicable. This way, it's much easier to see how much of the input is retained for each Package . For instance, your original readRecord function could read an arbitrary number of lines, while here, it's clear that readRecord can't cross the "boundary" imposed by splitByEmptyLines . Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com