Re: [Haskell-cafe] Question about memory usage

[CC-ing café again] On Aug 16, 2010, at 5:52 PM, Daniel Fischer wrote:
I am a bit concerned about the memory usage.
Of your implementation of the matrix power algorithm?
Yes.
Making the fields of Matrix strict should help:
data Matrix a = Matrix !a !a !a
Making all fields strict makes memory usage worse. When making only the first field strict, memory usage hardly differs (from the current implementation without strictness annotations). Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

On Mon, Aug 16, 2010 at 9:00 AM, Sebastian Fischer < sebf@informatik.uni-kiel.de> wrote:
[CC-ing café again]
On Aug 16, 2010, at 5:52 PM, Daniel Fischer wrote:
I am a bit concerned about the memory usage.
Of your implementation of the matrix power algorithm?
Yes.
Making the fields of Matrix strict should help:
data Matrix a = Matrix !a !a !a
Making all fields strict makes memory usage worse. When making only the first field strict, memory usage hardly differs (from the current implementation without strictness annotations).
That's interesting. So next I would use heap profiling to find out where and what type of data the calculation is using. I would do heap profiling and look at the types. The strictness above should when there are lots of thunks in memory for computing the values in the Matrix. To get started at this task, I recommend the chapter on performance from Real-World Haskell: http://book.realworldhaskell.org/read/profiling-and-optimization.html I'd love to hear what you learn as I probably won't have time to poke at it. Jason

On Aug 17, 2010, at 12:33 AM, Jason Dagit wrote:
So next I would use heap profiling to find out where and what type of data the calculation is using.
I did.
I would do heap profiling and look at the types.
All retained data is of type ARR_WORDS. Retainer profiling shows that the majority is retained by the cost center SYSTEM. I suspected that a strict, tail recursive version of `matrixPower` may be more efficient and implemented it: http://github.com/sebfisch/fibonacci/blob/tailrec/Data/Numbers/Fibonacci.hs But it's worse. Still, the only retained data is of type ARR_WORDS mostly retained by SYSTEM but even more of it now. Additional bang patterns on `square` and `times` make no difference. I wonder whether the numbers in a single step of the computation occupy all the memory or whether old numbers are retained although they shouldn't be. Are (even large) Integers evaluated completely when forcing their head-normal form? Any insight of a performance guru is welcome ;) Cheers, Sebastian [off topic post scriptum] On Aug 16, 2010, at 6:03 PM, Jacques Carette wrote:
Any sequence of numbers given by a linear recurrence equation with constant coefficients can be computed quickly using asymptotically efficient matrix operations. In fact, the code to do this can be derived automatically from the recurrence itself.
This is neat. Is it always M^n for some matrix M? How does it work? -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

On Tuesday 17 August 2010 08:59:29, Sebastian Fischer wrote:
I wonder whether the numbers in a single step of the computation occupy all the memory or whether old numbers are retained although they shouldn't be.
BTW, what sort of memory usage are we talking about here? I have now tried your code and I didn't find the memory usage too extreme. Be aware that fib (10^8) requires about 70 million bits and you need several times that for the computation. If you actually print out the entire number, the String will of course require an enormous amount of memory. Concerning the effect of strictifying, at least part of it is due to the fact that with a strict matrix, you need to make a few more multiplications of large Integers, those take time and the results occupy more space than the thunks. Considering that the recursion isn't deep (you simply don't have the memory to calculate fib (2^64)), laziness hasn't the chance to cause a space leak here, so bangifying doesn't help.
Are (even large) Integers evaluated completely when forcing their head-normal form?
Yes, when you force the weak head normal form of an Integer, it is completely evaluated.
Any insight of a performance guru is welcome ;)
Cheers, Sebastian
[off topic post scriptum]
On Aug 16, 2010, at 6:03 PM, Jacques Carette wrote:
Any sequence of numbers given by a linear recurrence equation with constant coefficients can be computed quickly using asymptotically efficient matrix operations. In fact, the code to do this can be derived automatically from the recurrence itself.
This is neat. Is it always M^n for some matrix M? How does it work?
Yes, it's always M^n. If the relation is a_n = c_1*a_(n-1) + ... + c_k*a_(n-k) you have the k×k matrix c_1 c_2 ... c_(k-1) c_k 1 0 ... 0 0 0 1 0 ... 0 0 0 0 1 0 ... 0 0 ... 0 1 0 0 0 ... 0 0 1 0 to raise to the n-th power, multiply the resulting matrix with the vector of initial values (a_(k-1), a_(k-2), ..., a_0) and take the last component (a propos of this, your Fibonacci numbers are off by one, fib 0 = 0, fib 9 = 34, fib 10 = 55). It works because M*(a_(n-1), ..., a_(n-k)) = (c_1*a_(n-1) + ... + c_k*a_(n-k), a_(k-1), ..., a_(n-(k-1))) = (a_n, a_(n-1), ..., a_(n-(k-1))) However, for large k, this isn't particularly efficient since standard matrix multiplication is O(k^3). These matrices have a special structure that allows doing a multiplication in O(k^2). You might want to look into the Cayley-Hamilton theorem for the latter.

BTW, what sort of memory usage are we talking about here?
I was referring to the memory usage of this program import System.Environment import Data.Numbers.Fibonacci main :: IO () main = do n <- (read . head) `fmap` getArgs (fib n :: Integer) `seq` return () compiled with -O2 and run with +RTS -s: ./calcfib 100000000 +RTS -s 451,876,020 bytes allocated in the heap 10,376 bytes copied during GC 19,530,184 bytes maximum residency (9 sample(s)) 12,193,760 bytes maximum slop 97 MB total memory in use (6 MB lost due to fragmentation) Generation 0: 40 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 9 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 12.47s ( 13.12s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 12.47s ( 13.13s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 36,242,279 bytes per MUT second Productivity 100.0% of total user, 95.0% of total elapsed I'm not sure how to interpret "bytes allocated" and "maximum residency" especially because the program spends no time during GC. But the 97 MB "total memory" correspond to what my process monitor shows.
I have now tried your code and I didn't find the memory usage too extreme. Be aware that fib (10^8) requires about 70 million bits and you need several times that for the computation.
Then, roughly ten 70m bit numbers fit into the total memory used: ghci> (97 * 1024^2 * 8) / (70 * 10^6) 11.624213942857143 I expected to retain about three of such large numbers, not ten, and although the recursion depth is not deep I expected some garbage. Both expectations may be a mistake, of course.
If the relation is
a_n = c_1*a_(n-1) + ... + c_k*a_(n-k)
you have the k×k matrix [...] to raise to the n-th power, multiply the resulting matrix with the vector of initial values (a_(k-1), a_(k-2), ..., a_0) and take the last component
Interesting.
(a propos of this, your Fibonacci numbers are off by one, fib 0 = 0, fib 9 = 34, fib 10 = 55).
Wikipedia also starts with zero but states that "some sources" omit the leading zero. I took the liberty to do the same (and documented it like this) as it seems to match the algorithm better. It also corresponds to the rabbit analogy.
These matrices have a special structure that allows doing a multiplication in O(k^2). You might want to look into the Cayley-Hamilton theorem for the latter.
I don't see the link to the CH theorem yet -- probably because I didn't know it. I did observe that all matrices during the computation have the form /a b\ \b c/ which simplifies multiplications. Is this related to CH? Or can I further improve the multiplications using insights from CH? Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Sebastian Fischer wrote:
BTW, what sort of memory usage are we talking about here?
I was referring to the memory usage of this program
import System.Environment import Data.Numbers.Fibonacci
main :: IO () main = do n <- (read . head) `fmap` getArgs (fib n :: Integer) `seq` return ()
compiled with -O2 and run with +RTS -s:
./calcfib 100000000 +RTS -s 451,876,020 bytes allocated in the heap 10,376 bytes copied during GC 19,530,184 bytes maximum residency (9 sample(s)) 12,193,760 bytes maximum slop 97 MB total memory in use (6 MB lost due to fragmentation)
Generation 0: 40 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 9 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed) MUT time 12.47s ( 13.12s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 12.47s ( 13.13s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 36,242,279 bytes per MUT second
Productivity 100.0% of total user, 95.0% of total elapsed ..
You could try: fibo :: Int -> Integer fibo n = if n<=1 then fromIntegral n else if even n then let (f_nd2m1,f_nd2) = dfibo (n `quot` 2-1) in f_nd2 * (f_nd2 + (f_nd2m1+f_nd2m1)) else let (f_nd2,f_nd2p1) = dfibo (n `quot` 2) in f_nd2*f_nd2 + f_nd2p1*f_nd2p1 where dfibo :: Int -> (Integer,Integer) dfibo n = if n<=1 then (fromIntegral n,1) else let (f_nd2,f_nd2p1) = dfibo (n `quot` 2) in if even n then let f_n = (f_nd2p1 + (f_nd2p1-f_nd2)) * f_nd2 f_np1 = f_nd2*f_nd2 + f_nd2p1*f_nd2p1 in seq f_n (seq f_np1 (f_n,f_np1)) else let f_n = f_nd2*f_nd2 + f_nd2p1*f_nd2p1 f_np1 = f_nd2p1 * (f_nd2p1 + (f_nd2+f_nd2)) in seq f_n (seq f_np1 (f_n,f_np1)) It allocates less and has a smaller maximum residency: (ghc 6.12.2,windows 7 64) 292,381,520 bytes allocated in the heap 15,200 bytes copied during GC 13,020,308 bytes maximum residency (8 sample(s)) 6,423,332 bytes maximum slop 99 MB total memory in use (9 MB lost due to fragmentation) Generation 0: 29 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 8 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.02s elapsed) MUT time 5.85s ( 5.88s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 5.85s ( 5.90s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 49,979,430 bytes per MUT second Productivity 100.0% of total user, 99.2% of total elapsed instead of: 451,864,588 bytes allocated in the heap 8,600 bytes copied during GC 17,362,424 bytes maximum residency (8 sample(s)) 11,332,592 bytes maximum slop 99 MB total memory in use (9 MB lost due to fragmentation) Generation 0: 41 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 8 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.02s elapsed) MUT time 9.11s ( 9.14s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 9.11s ( 9.16s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 49,598,449 bytes per MUT second Productivity 100.0% of total user, 99.5% of total elapsed Kind regards. John van Groningen

Hi John,
You could try: [...]
It allocates less and has a smaller maximum residency: (ghc 6.12.2,windows 7 64)
292,381,520 bytes allocated in the heap 13,020,308 bytes maximum residency (8 sample(s)) 99 MB total memory in use (9 MB lost due to fragmentation)
MUT time 5.85s ( 5.88s elapsed)
instead of:
451,864,588 bytes allocated in the heap 17,362,424 bytes maximum residency (8 sample(s)) 99 MB total memory in use (9 MB lost due to fragmentation)
MUT time 9.11s ( 9.14s elapsed)
Interesting. It uses the same amount of memory but is faster probably because it allocates less. But I prefer programs for people to read over programs for computers to execute and I have a hard time to verify that your algorithm computes Fibonacci numbers. I find it much easier to see from my implementation. Is your implementation the same algorithm? If yes, the transformations you made by hand should ideally be made by the compiler! Cheers, Sebastian -- Underestimating the novelty of the future is a time-honored tradition. (D.G.)

Sebastian Fischer wrote:
.. Interesting. It uses the same amount of memory but is faster probably because it allocates less.
But I prefer programs for people to read over programs for computers to execute and I have a hard time to verify that your algorithm computes
According to: http://en.wikipedia.org/wiki/Fibonacci_number F(2n-1) = F(n)^2 + F(n-1)^2 -- 1 F(2n) = (2*F(n-1) + F(n)) * F(n) -- 2 therefore: F(2n) = (2*(F(n+1) - F(n)) + F(n)) * Fn -- use F(n-1)=F(n+1)-F(n) in 2 = (2* F(n+1) - F(n)) * Fn F(2n+1) = F(n+1)^2 + F(n)^2 -- substite n+1 in 1 F(2n+2) = F (2*(n+1)) -- use 2 = (2*F(n) + F(n+1)) * F(n+1) So if we know F(n) and F(n+1) we can compute F(2n), F(2n+1) and F(2n+2). dfibo computes (F(n),F(n+1)) using these equations.
..
Kind regards, John van Groningen

On Tuesday 17 August 2010 17:33:15, Sebastian Fischer wrote:
BTW, what sort of memory usage are we talking about here?
./calcfib 100000000 +RTS -s 451,876,020 bytes allocated in the heap 10,376 bytes copied during GC 19,530,184 bytes maximum residency (9 sample(s)) 12,193,760 bytes maximum slop 97 MB total memory in use
That's compatible with the figures I got for a similar programme. I just wanted to make sure you had no much higher usage.
I'm not sure how to interpret "bytes allocated" and "maximum residency" especially because the program spends no time during GC. But the 97 MB "total memory" correspond to what my process monitor shows.
The bytes allocated figure says how many bytes the RTS allocated in total during the run of the programme. It's not closely related to memory requirements, for example if you have an almost tight loop, which just has too many variables to keep them all in registers, you'll get high allocation figures for long running loops even though it runs in small constant space (of course, allocating, say, one Int on the heap in each iteration costs performance versus an entirely-in-registers-loop). The maximum residency, iirc, is the maximal size of live heap the garbage collector finds in any of the major GCs (there were 9, but since there were only few items, all 9 major and 40 minor GCs together took too little time to display a positive value). 19.5MB could be the one final Fibonacci number plus three of about half the size, that seems a reasonable value (and if the GC ran at slightly different times, you might have gotten different values). The total memory in use also includes garbage waiting to be collected. I don't remember any details fo GHC's GC algorithm, but having total memory about five times maximum residency doesn't seem unbelievable.
I have now tried your code and I didn't find the memory usage too extreme. Be aware that fib (10^8) requires about 70 million bits and you need several times that for the computation.
Then, roughly ten 70m bit numbers fit into the total memory used:
ghci> (97 * 1024^2 * 8) / (70 * 10^6) 11.624213942857143
I expected to retain about three of such large numbers, not ten, and although the recursion depth is not deep I expected some garbage. Both expectations may be a mistake, of course.
Yes, the memory usage is higher than I'd expect too, but it's well-behaved (sublinear growth), so I'm not overly concerned.
If the relation is
a_n = c_1*a_(n-1) + ... + c_k*a_(n-k)
you have the k×k matrix [...] to raise to the n-th power, multiply the resulting matrix with the vector of initial values (a_(k-1), a_(k-2), ..., a_0) and take the last component
Interesting.
(a propos of this, your Fibonacci numbers are off by one, fib 0 = 0, fib 9 = 34, fib 10 = 55).
Wikipedia also starts with zero but states that "some sources" omit the leading zero. I took the liberty to do the same (and documented it like this) as it seems to match the algorithm better. It also corresponds to the rabbit analogy.
It breaks however the nice properties that fib n = (phi^n - (1-phi)^n)/sqrt 5 fib n `mod` 5 == 0 <=> n `mod` 5 == 0 p prime => fib (p - (p|5)) `mod` p == 0 fib n = (-1)^(n+1)*fib (-n) [where, due to typographical limitations, I wrote the Legendre symbol as (p|q)] For aesthetical reasons, I continue to regard setting fib 0 = 1 as wrong.
These matrices have a special structure that allows doing a multiplication in O(k^2). You might want to look into the Cayley-Hamilton theorem for the latter.
I don't see the link to the CH theorem yet -- probably because I didn't know it.
Cayley-Hamilton allows you to replace the matrix exponentiation by calculating (X^n) modulo the characteristic polynomial of M and thus matrix multiplication by multiplication of polynomials (of degree <= (k-1)). The link is pretty indirect, however, and Cayley-Hamilton/multiplication of polynomials is the endpoint of a twisted path.
I did observe that all matrices during the computation have the form
/a b\ \b c/
which simplifies multiplications.
Yes, the powers of a symmetric matrix are always symmetric. That follows from trans(A·B) = trans(B)·trans(A), so trans(A^n) = (trans(A))^n which, if A is symmetric, i.e. trans(A) = A, reduces to A^n.
Is this related to CH?
No, entirely unrelated. It's the fact that the coefficient of fib (n-2) is 1 in the recursion formula for fib n.
Or can I further improve the multiplications using insights from CH?
It's not worth it for Fibonacci numbers: for small k, an O(k^3) algorithm is better than an O(k^2) algorithm with larger constant factors. If you had a linear recursion involving say a_(n-1000) it'd be a huge gain.

Daniel Fischer wrote:
On Aug 16, 2010, at 6:03 PM, Jacques Carette wrote:
Any sequence of numbers given by a linear recurrence equation with constant coefficients can be computed quickly using asymptotically efficient matrix operations. In fact, the code to do this can be derived automatically from the recurrence itself.
This is neat. Is it always M^n for some matrix M? How does it work?
Yes, it's always M^n.
If the relation is
a_n = c_1*a_(n-1) + ... + c_k*a_(n-k)
you have the k×k matrix
c_1 c_2 ... c_(k-1) c_k 1 0 ... 0 0 0 1 0 ... 0 0 0 0 1 0 ... 0 0 ... 0 1 0 0 0 ... 0 0 1 0
to raise to the n-th power,
However, for large k, this isn't particularly efficient since standard matrix multiplication is O(k^3). I said "asymptotically efficient matrix multiplication", which in practice is between O(k^2.7) and O(k^2.5) depending on the implementation.
These matrices have a special structure that allows doing a multiplication in O(k^2). You might want to look into the Cayley-Hamilton theorem for the latter.
Special multiplication by M is indeed O(k^2), but M^n is going to be dense (if the order of the recurrence is k, then M^n is fully dense for n>=k). And the interesting part is getting high iterates out, not having super large recurrences. In any case, this really doesn't matter in practice since k tends to be *fixed*, so it is really a 'small constant'. The real advantage comes through doing *binary powering*, not the matrix arithmetic. Jacques
participants (5)
-
Daniel Fischer
-
Jacques Carette
-
Jason Dagit
-
John van Groningen
-
Sebastian Fischer