Data.Binary.Get for large files

Hello again folks, Sorry to keep troubling you - I'm very appreciative of the help you've given so far. I've got one more for you that has got me totally stumped. I'm writing a program which deals with largish-files, the one I am using as a test case is not stupidly large at about 200mb. After three evenings, I have finally gotten rid of all the stack overflows, but I am unfortunately left with something that is rather unfeasably slow. I was hoping someone with some keener skills than I could take a look, I've tried to distill it to the simplest case. This program just reads in a file, interpreting each value as a double, and does a sort of running average on them. The actual function doesn't matter too much, I think it is the reading it in that is the problem. Here's the code: import Control.Exception import qualified Data.ByteString.Lazy as BL import Data.Binary.Get import System.IO import Data.Binary.IEEE754 myGetter acc = do e <- isEmpty if e == True then return acc else do t <- getFloat64le myGetter $! ((t+acc)/2) myReader file = do h <- openBinaryFile file ReadMode bs <- BL.hGetContents h return $ runGet (myGetter 0) bs main = do d <- myReader "data.bin" evaluate d This takes about three minutes to run on my (fairly modern) laptop.. The equivilant C program takes about 5 seconds. I'm sure I am doing something daft, but I can't for the life of me see what. Any hints about how to get the profiler to show me useful stuff would be much appreciated! All the best, Philip PS: If, instead of computing a single value I try and build a list of the values, the program ends up using over 2gb of memory to read a 200mb file.. any ideas on that one?

I can't find the error in your code (assuming there is an error), so I'm checking the code you didn't write, and the only thing that set off an alarm was... getFloat64le :: Get Double getFloat64le = getFloat (ByteCount 8) $ splitBytes . reverse splitBytes :: [Word8] -> RawFloat ...that every chunk read in the Get monad is being reversed, so that you can take one float (and you are taking in over 26 million floats) in little endian. I really don't know if this hits performance, but I assume the C equivalent would be reading an array in reverse order. I am more than willing to believe this is not the cause of such performance loss, but can't find a reason. PS1: "(e == True)" == "e" PS2: I know it's not important, but I can't help it: that is not an average you're computing... El jue, 29-04-2010 a las 23:37 +0100, Philip Scott escribió:
Hello again folks,
Sorry to keep troubling you - I'm very appreciative of the help you've given so far. I've got one more for you that has got me totally stumped. I'm writing a program which deals with largish-files, the one I am using as a test case is not stupidly large at about 200mb. After three evenings, I have finally gotten rid of all the stack overflows, but I am unfortunately left with something that is rather unfeasably slow. I was hoping someone with some keener skills than I could take a look, I've tried to distill it to the simplest case.
This program just reads in a file, interpreting each value as a double, and does a sort of running average on them. The actual function doesn't matter too much, I think it is the reading it in that is the problem. Here's the code:
import Control.Exception import qualified Data.ByteString.Lazy as BL import Data.Binary.Get import System.IO import Data.Binary.IEEE754
myGetter acc = do e <- isEmpty if e == True then return acc else do t <- getFloat64le myGetter $! ((t+acc)/2)
myReader file = do h <- openBinaryFile file ReadMode bs <- BL.hGetContents h return $ runGet (myGetter 0) bs
main = do d <- myReader "data.bin" evaluate d
This takes about three minutes to run on my (fairly modern) laptop.. The equivilant C program takes about 5 seconds.
I'm sure I am doing something daft, but I can't for the life of me see what. Any hints about how to get the profiler to show me useful stuff would be much appreciated!
All the best,
Philip
PS: If, instead of computing a single value I try and build a list of the values, the program ends up using over 2gb of memory to read a 200mb file.. any ideas on that one?
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Am Freitag 30 April 2010 02:46:08 schrieb MAN:
I can't find the error in your code (assuming there is an error), so I'm checking the code you didn't write, and the only thing that set off an alarm was...
getFloat64le :: Get Double getFloat64le = getFloat (ByteCount 8) $ splitBytes . reverse
splitBytes :: [Word8] -> RawFloat
...that every chunk read in the Get monad is being reversed, so that you can take one float (and you are taking in over 26 million floats) in little endian. I really don't know if this hits performance, but I assume the C equivalent would be reading an array in reverse order. I am more than willing to believe this is not the cause of such performance loss, but can't find a reason.
The reversing doesn't matter much, using getFloat64be takes very nearly the same time.
PS1: "(e == True)" == "e"
Yes!
PS2: I know it's not important, but I can't help it: that is not an average you're computing...
Sort of a weighted average, where double #k is weighted 2^(k-1-n).

Am Freitag 30 April 2010 00:37:59 schrieb Philip Scott:
Hello again folks,
Sorry to keep troubling you - I'm very appreciative of the help you've given so far. I've got one more for you that has got me totally stumped. I'm writing a program which deals with largish-files, the one I am using as a test case is not stupidly large at about 200mb. After three evenings, I have finally gotten rid of all the stack overflows, but I am unfortunately left with something that is rather unfeasably slow. I was hoping someone with some keener skills than I could take a look, I've tried to distill it to the simplest case.
This program just reads in a file, interpreting each value as a double, and does a sort of running average on them. The actual function doesn't matter too much, I think it is the reading it in that is the problem.
Replace getFloat64le with e.g. getWord64le to confirm. The reading of IEEE754 floating point numbers seems rather complicated. Maybe doing it differently could speed it up, maybe not.
This takes about three minutes to run on my (fairly modern) laptop.. The equivilant C program takes about 5 seconds.
Are you sure that it's really equivalent?
I'm sure I am doing something daft, but I can't for the life of me see what. Any hints about how to get the profiler to show me useful stuff would be much appreciated!
All the best,
Philip
PS: If, instead of computing a single value I try and build a list of the values, the program ends up using over 2gb of memory to read a 200mb file.. any ideas on that one?
Hm, 200MB file => ~25 million Doubles, such a list needs at least 400MB. Still a long way to 2GB. I suspect you construct a list of thunks, not Doubles.

Hi Daniel
Replace getFloat64le with e.g. getWord64le to confirm. The reading of IEEE754 floating point numbers seems rather complicated. Maybe doing it differently could speed it up, maybe not.
That speeds things up by a factor of about 100 :) I think there must be some efficiency to be extracted from there somewhere.. Either the IEEE module or the Data.Binary.Get. Is it possible to get the profiler to look deeper than the top level module? With all the options I could find, it only ever tells me about things in the file I am dealing with..Hm, 200MB file => ~25 million Doubles, such a list needs at least 400MB.
Still a long way to 2GB. I suspect you construct a list of thunks, not Doubles.
I think you are almost certainly right. Is there an easy way to see if/how/where this is happening? Thanks once again, Philip

Check out the Real World Haskell chapter on profiling, it should have
everything you need to track down where the thunks are sneaking in:
http://book.realworldhaskell.org/read/profiling-and-optimization.html
It's particularly great in this case because the problem being diagnosed in
that chapter is most likely the same sort of problem you're seeing.
-R. Kyle Murphy
--
Curiosity was framed, Ignorance killed the cat.
On Fri, Apr 30, 2010 at 17:06, Philip Scott
Hi Daniel
Replace getFloat64le with e.g. getWord64le to confirm.
The reading of IEEE754 floating point numbers seems rather complicated. Maybe doing it differently could speed it up, maybe not.
That speeds things up by a factor of about 100 :)
I think there must be some efficiency to be extracted from there somewhere.. Either the IEEE module or the Data.Binary.Get.
Is it possible to get the profiler to look deeper than the top level module? With all the options I could find, it only ever tells me about things in the file I am dealing with..Hm, 200MB file => ~25 million Doubles, such a list needs at least 400MB.
Still a long way to 2GB. I suspect you construct a list of thunks, not
Doubles.
I think you are almost certainly right. Is there an easy way to see if/how/where this is happening?
Thanks once again,
Philip
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Am Freitag 30 April 2010 23:06:07 schrieb Philip Scott:
Hi Daniel
Replace getFloat64le with e.g. getWord64le to confirm. The reading of IEEE754 floating point numbers seems rather complicated. Maybe doing it differently could speed it up, maybe not.
That speeds things up by a factor of about 100 :)
Yes, I too.
I think there must be some efficiency to be extracted from there somewhere.. Either the IEEE module
Look at the code. It does a lot of hard work. That is probably necessary to ensure correctness, but it's sloow. If you feel like playing with fire, {-# LANGUAGE BangPatterns, MagicHash #-} import qualified Data.ByteString.Lazy as BL import Data.Binary import Data.Binary.Get import GHC.Prim import Data.Word getFloat64le :: Get Double getFloat64le = fmap unsafeCoerce# getWord64le myGetter !acc = do e <- isEmpty if e then return acc else do !t <- getFloat64le myGetter ((t+acc)/2) may work on your system (no warranties, you know what 'unsafe' means, don't you?).
or the Data.Binary.Get.
Considering that it's quick enough getting Word64, you won't get much improvement there.
Is it possible to get the profiler to look deeper than the top level module?
Lots of {-# SCC "foo" #-} pragmas. Or create a local copy of the module and import that, then -prof -auto-all should give more info.
With all the options I could find, it only ever tells me about things in the file I am dealing with..Hm, 200MB file => ~25 million Doubles, such a list needs at least 400MB.
Still a long way to 2GB. I suspect you construct a list of thunks, not Doubles.
I think you are almost certainly right. Is there an easy way to see if/how/where this is happening?
Read the core, profile with all -hx flags and look at the profiles, show the code to more experienced Haskellers.
Thanks once again,
Philip
participants (4)
-
Daniel Fischer
-
Kyle Murphy
-
MAN
-
Philip Scott