stack overflow summing numbers read from a big file

When I try to run the following code on a 50M file consisting of one number per line I receive a `Stack space overflow' error and the program seems to consume a lot of memory. Why does that happen? And how can I fix the problem without increasing the Stack with -Ksize hoping that it will be big enough? Any general advice on processing big files with Haskell? -- sumFile.hs -- adapted from Real World Haskell's SumFile.hs at the beginning of -- Chapter 8 main = interact sumFile where sumFile = show . sum . map read . words ghc -o sumFile sumFile.hs ./sumFile < ./numbers

Hi Axel
If you see the type of your function
Prelude> :t ( show . sum . map read . words )
( show . sum . map read . words ) :: String -> String
It takes a string and return string.
On Sat, Mar 23, 2013 at 6:12 PM, Axel Wegen
When I try to run the following code on a 50M file consisting of one number per line I receive a `Stack space overflow' error and the program seems to consume a lot of memory. Why does that happen?
It's already mentioned there "A String is represented as a list of Charvalues; each element of a list is allocated individually, and has some book-keeping overhead. These factors affect the memory consumption and performance of a program that must read or write text or binary data. On simple benchmarks like this, even programs written in interpreted languages such as Python can outperform Haskell code that uses String by an order of magnitude".
And how can I fix the problem without increasing the Stack with -Ksize hoping that it will be big enough? Any general advice on processing big files with Haskell?
Each ByteString type performs better under particular circumstances. For streaming a large quantity (hundreds of megabytes to terabytes) of data, the lazy ByteString type is usually best. Its chunk size is tuned to be friendly to a modern CPU's L1 cache, and a garbage collector can quickly discard chunks of streamed data that are no longer being used.
-- sumFile.hs -- adapted from Real World Haskell's SumFile.hs at the beginning of -- Chapter 8 main = interact sumFile where sumFile = show . sum . map read . words
ghc -o sumFile sumFile.hs ./sumFile < ./numbers
See if this code is working ( I haven't tested it on big file ) import qualified Data.ByteString.Lazy.Char8 as BS import Data.Maybe ( fromJust ) readI :: BS.ByteString -> Integer readI = fst . fromJust . BS.readInteger main = BS.interact sumFile where sumFile = BS.pack . show . sum . map readI . BS.words -Mukesh
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

mukesh tiwari
It's already mentioned there "A String is represented as a list of Char values; each element of a list is allocated individually, and has some book-keeping overhead. These factors affect the memory consumption and performance of a program that must read or write text or binary data. On simple benchmarks like this, even programs written in interpreted languages such as Python can outperform Haskell code that uses String by an order of magnitude".
I assumed that just means using plain Strings for that job will take more time.
import qualified Data.ByteString.Lazy.Char8 as BS import Data.Maybe ( fromJust )
readI :: BS.ByteString -> Integer readI = fst . fromJust . BS.readInteger
main = BS.interact sumFile where sumFile = BS.pack . show . sum . map readI . BS.words
I get the same result, the stack overflow. Though I don't have wait as long for it to break. I think that there is something about the summation that makes it impossible for the compiler to do it's magic and optimize the thing to something less stack overflowing. I just don't understand what that is. -- Axel Wegen

Hi Alex
I am no expert but try to profile your code and see the memory usage. It
seems like the sum function is causing the stack overflow[1].
Try this one.
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Maybe ( fromJust )
import Data.List
readI :: BS.ByteString -> Integer
readI = fst . fromJust . BS.readInteger
main = BS.interact $ sumFile where
sumFile = BS.pack . show . sum' . map readI . BS.words
sum' = foldl' (+) 0
-Mukesh
[1] http://www.haskell.org/haskellwiki/Memory_leak
On Sat, Mar 23, 2013 at 8:59 PM, Axel Wegen
mukesh tiwari
writes: It's already mentioned there "A String is represented as a list of Char values; each element of a list is allocated individually, and has some book-keeping overhead. These factors affect the memory consumption and performance of a program that must read or write text or binary data. On simple benchmarks like this, even programs written in interpreted languages such as Python can outperform Haskell code that uses String by an order of magnitude".
I assumed that just means using plain Strings for that job will take more time.
import qualified Data.ByteString.Lazy.Char8 as BS import Data.Maybe ( fromJust )
readI :: BS.ByteString -> Integer readI = fst . fromJust . BS.readInteger
main = BS.interact sumFile where sumFile = BS.pack . show . sum . map readI . BS.words
I get the same result, the stack overflow. Though I don't have wait as long for it to break.
I think that there is something about the summation that makes it impossible for the compiler to do it's magic and optimize the thing to something less stack overflowing. I just don't understand what that is.
-- Axel Wegen
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

mukesh tiwari
It seems like the sum function is causing the stack overflow[1]. Graham Gill
writes: I think the problem is with laziness in the accumulator of "sum". The prelude "sum" is defined as
`Real World Haskell' actually has a warning regarding the use of
Prelude's foldl, the space leaks in causes and how to force evaluation
using `seq' in Chapter 4 under the headings `Left Folds, Laziness, and
Space Leaks' and `Avoiding Space Leaks with seq'. It just slipped my
mind, and I had the wrong assumption of the `sum' function.
mukesh tiwari
import Data.List sum' = foldl' (+) 0
An explicit version:
sum' l = sum'' l 0
where
sum'' [] a = a
sum'' (x:xs) a = let intermediate = a + x
in intermediate `seq` sum'' xs intermediate
Graham Gill
mysum xs = sum' xs 0 where sum' [] a = a sum' (x:xs) !a = sum' xs (x + a)
http://www.haskell.org/haskellwiki/Memory_leak http://www.haskell.org/haskellwiki/Performance/Strictness Anyway I just wanted to thank you for helping me solve my problem, for
All three versions work. foldl' uses `seq' to force evaluation and therefore strictness and the `BangPatterns' are a nice syntactic feature to express strictness, because the excessive use of `seq' can make code unwieldy. At least that is as far as my not fully correct understanding goes for now. the helpful links and slightly improving my understanding of Haskell's laziness. Long way to go. -- Axel Wegen

On Sun, Mar 24, 2013 at 9:04 PM, Axel Wegen
mukesh tiwari
writes: It seems like the sum function is causing the stack overflow[1]. Graham Gill
writes: I think the problem is with laziness in the accumulator of "sum". The prelude "sum" is defined as `Real World Haskell' actually has a warning regarding the use of Prelude's foldl, the space leaks in causes and how to force evaluation using `seq' in Chapter 4 under the headings `Left Folds, Laziness, and Space Leaks' and `Avoiding Space Leaks with seq'. It just slipped my mind, and I had the wrong assumption of the `sum' function.
Well actually, you wouldn't run into this problem if you compiled with optimization options (ghc -O2 -o sumFile sumFile.hs) since the strictness analyzer is pretty good at spotting strict sums. -- Jedaï

Well actually, you wouldn't run into this problem if you compiled with optimization options (ghc -O2 -o sumFile sumFile.hs) since the strictness analyzer is pretty good at spotting strict sums. Yeah you are right. The strictness analyzer was also mentioned in one of
Chaddaï Fouché

I think the problem is with laziness in the accumulator of "sum". The prelude "sum" is defined as sum l = sum' l 0 where sum' [] a = a sum' (x:xs) a = sum' xs (a+x) For example, if in GHCi I execute
sum [1..50000000] <interactive>: out of memory
I get an out of memory exception. On the other hand, if I enable -XBangPatterns and define mysum xs = sum' xs 0 where sum' [] a = a sum' (x:xs) !a = sum' xs (x + a) Then in GHCi
mysum [1..50000000] 1250000025000000
which as you can easily check using n * (n + 1) / 2 with n = 50000000 is correct. I don't know whether I have used the strictness annotation in the best way, but it works. See http://www.haskell.org/haskellwiki/Performance/Strictness, which had some helpful hints. Graham On 23/03/2013 11:29 AM, Axel Wegen wrote:
mukesh tiwari
writes: It's already mentioned there "A String is represented as a list of Char values; each element of a list is allocated individually, and has some book-keeping overhead. These factors affect the memory consumption and performance of a program that must read or write text or binary data. On simple benchmarks like this, even programs written in interpreted languages such as Python can outperform Haskell code that uses String by an order of magnitude". I assumed that just means using plain Strings for that job will take more time.
import qualified Data.ByteString.Lazy.Char8 as BS import Data.Maybe ( fromJust )
readI :: BS.ByteString -> Integer readI = fst . fromJust . BS.readInteger
main = BS.interact sumFile where sumFile = BS.pack . show . sum . map readI . BS.words I get the same result, the stack overflow. Though I don't have wait as long for it to break.
I think that there is something about the summation that makes it impossible for the compiler to do it's magic and optimize the thing to something less stack overflowing. I just don't understand what that is.
-- Axel Wegen
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
participants (4)
-
Axel Wegen
-
Chaddaï Fouché
-
Graham Gill
-
mukesh tiwari