Excess mem consumption in file IO task

Hi! I have some resource problems when extracting data from a file. The task is as follows: I have a huge (500MB) binary file, containing some interesting parts and lots of rubbish. Furthermore, there is a directory that tells me the parts of the file (first- and last byte index) that contain the substrings I need. My approach to do this is to open the file and to pass the list of addresses along with the handle to a function that processes the list step-by-step and calls a subfunction which uses the handle to seek the start position of the interesting block, reads the block into a bytestring (lazy or not, didn't make any difference here) and calls the function that scans this byte string for the interesting part. Using this approach - which results in a data structure with an approximate size of 10 MB - the program uses hundreds of megabytes of RAM, which forces my computer to swap (with the obvious results...). I have right now two main suspects: The recursive function is tail-recursive, but I don't know whether the usual way to write these functions (with an accumulator etc) works in monadic code (the stage is, of course, the IO monad, and I am using the do-notation as I don't like the only other way I know, writing lambdas and lambdas and lambdas into the function body). The other problem I can imagine is the passing-around of the file handle, and the subsequent reading of byte strings: Are those strings somehow attached to the handle, and does the handle work in a different way than I expected, i.e. is the handle copied while using it as an argument for another function, and exists something like a register of handles that keeps the connection upright and, therefore, excludes the (handle, string)-chunk from garbage collection? I have, of course, been experimenting with the "seq" - function, but, honestly, I am not sure whether I got it right. Does a call to "identity $! (function arguments ...)" force the full evaluation of the function? Greetings! Moritz

"Moritz Tacke"
I have some resource problems when extracting data from a file. The task is as follows: I have a huge (500MB) binary file, containing some interesting parts and lots of rubbish. Furthermore, there is a directory that tells me the parts of the file (first- and last byte index) that contain the substrings I need. My approach to do this is to open the file and to pass the list of addresses along with the handle to a function that processes the list step-by-step and calls a subfunction which uses the handle to seek the start position of the interesting block, reads the block into a bytestring (lazy or not, didn't make any difference here) and calls the function that scans this byte string for the interesting part. Using this approach - which results in a data structure with an approximate size of 10 MB - the program uses hundreds of megabytes of RAM, which forces my computer to swap (with the obvious results...).
You may want to post the relevant parts of your source code on hpaste.org for reference.
I have right now two main suspects: The recursive function is tail-recursive, but I don't know whether the usual way to write these functions (with an accumulator etc) works in monadic code (the stage is, of course, the IO monad, and I am using the do-notation as I don't like the only other way I know, writing lambdas and lambdas and lambdas into the function body). The other problem I can imagine is the passing-around of the file handle, and the subsequent reading of byte strings: Are those strings somehow attached to the handle, and does the handle work in a different way than I expected, i.e. is the handle copied while using it as an argument for another function, and exists something like a register of handles that keeps the connection upright and, therefore, excludes the (handle, string)-chunk from garbage collection?
Usually no, unless you read the file with a lazy read function like hGetContents. And the normal notation and the do-notation are equivalent. When compiling, the do-notation is simply translated to the normal notation.
I have, of course, been experimenting with the "seq" - function, but, honestly, I am not sure whether I got it right. Does a call to "identity $! (function arguments ...)" force the full evaluation of the function?
No, a `seq` b says that before evaluating 'b', 'a' should be evaluated. The function itself may treat its arguments lazily, which makes a difference, when it's recursive. Greets, Ertugrul. -- nightmare = unsafePerformIO (getWrongWife >>= sex) http://blog.ertes.de/
participants (2)
-
Ertugrul Soeylemez
-
Moritz Tacke