Begin forwarded message:

From: Mahdi <mdibaiee@aol.com>
Subject: Re: [Haskell-beginners] CAF taking all the memory
Date: May 8, 2016 at 18:06:29 GMT+4:30
To: Ben Gamari <ben@smart-cactus.org>

Wow, This is a great explanation! Thank you!
I see your point on using lazy lists (although I was not aware of the specific problems you mentioned).
So, I switched over to HMatrix [0], (the libraries you mentioned added a lot of complexity to the code)
which uses unboxed arrays for matrices.

After refactoring my code to use HMatrix, I ran the program, and the same thing happens,
the speed hasn’t changed much either, then I tried to take another profile to see if anything has changed,
but I couldn’t as described here [1].

But my main question is, why would Haskell even store all this data? There are no side-effects, so
there should be no memory leak in my code.
From what I read online, Haskell is trying to avoid re-calculating the functions by caching their
input/outputs (they’re pure, so this can be done), but I hoped there would be a way to disable this,
unfortunately there doesn’t seem to be a way (or there is?).
If that’s the case, then using libraries won’t solve the problem, and that’s frustrating.

* I also tried compiling with -O0, but no luck.

[0] http://dis.um.es/~alberto/hmatrix/hmatrixtut.html
[1] https://github.com/albertoruiz/hmatrix/issues/190

Thank you Ben!

- Mahdi

On May 3, 2016, at 14:36, Ben Gamari <ben@smart-cactus.org> wrote:

Mahdi <mdibaiee@aol.com> writes:

Hello there,

Hi!

I’m pretty new to Haskell and I’m trying to write a Neural Network in
Haskell for educational purposes.

So, I have my neural network working, it can learn XOR in 500
iterations, but the thing is, if I increase the iterations to
something like 5 milion times, the process just drains my RAM until
either it’s killed or the OS drowns. Here is the code: [0]

I searched online and tried to get information on why this is
happening, I profiled the memory usage and found that the memory is
taken by CAF, searching online,

Great! The next question you should then ask is "which CAF"? A CAF is
simply a top-level "constant" in your program. Indeed it sounds like you
have not defined any cost-centers, which means that GHC will attribute
the entire cost of your program to the ambiguous cost-center "CAF"
(which in this case really just means "the whole program").

As discussed in the users guide [1] one way to define cost-centers
within your program is to manually annotate expressions with SCC
pragmas. However, in this case we simply want to let GHC do this for us,
for which we can use the `-fprof-auto -fprof-cafs` flags (which
automatically annotate top-level definitions and CAFs with
cost-centers),

  $ ghc -O examples/xor.hs -fprof-auto -fprof-cafs

Now your program should give a much more useful profile (run having set
the iteration count to 50000),

  $ time examples/xor +RTS -p
  [[0.99548584],[4.5138146e-3],[0.9954874],[4.513808e-3]]

  real 0m4.019s
  user 0m0.004s
  sys 0m4.013s
  $ cat xor.prof
    Tue May  3 11:15 2016 Time and Allocation Profiling Report  (Final)

      xor +RTS -p -RTS

    total time  =        1.11 secs   (1107 ticks @ 1000 us, 1 processor)
    total alloc = 1,984,202,600 bytes  (excludes profiling overheads)

  COST CENTRE  MODULE      %time %alloc

  matrixMap    Utils.Math   21.4   26.8
  matadd       Utils.Math   20.8   22.7
  matmul.\     Utils.Math   10.4   16.0
  dot          Utils.Math    7.1   13.4
  column       Utils.Math    7.0    2.9
  dot.\        Utils.Math    6.9    1.9
  rowPairs     Utils.Math    5.8    6.5
  sigmoid'     NN            4.7    0.8
  train.helper Main          4.0    1.3
  sigmoid      NN            3.3    0.8
  matmul       Utils.Math    2.2    2.0
  hadamard     Utils.Math    1.8    2.1
  columns      Utils.Math    1.2    1.3

                                                                      individual      inherited
  COST CENTRE                MODULE                no.     entries  %time %alloc   %time %alloc

  MAIN                       MAIN                   79          0    0.0    0.0   100.0  100.0
  [snip]
   CAF:main3                 Main                  143          0    0.0    0.0   100.0  100.0
    (...)                    Main                  162          1    0.0    0.0   100.0  100.0
     train                   Main                  163          1    0.0    0.0   100.0  100.0
      train.helper           Main                  164      50001    4.0    1.3   100.0  100.0
       train.helper.hweights Main                  258      50001    0.5    0.0     0.5    0.0
       train.helper.oweights Main                  235      50001    0.4    0.0     0.4    0.0
       train.helper.oback    Main                  207      50000    0.3    0.1    19.0   20.9
        backward'            NN                    208      50000    0.3    0.6    18.7   20.8
  [snip]

So, here we see that costs are actually spread throughout the program.

Without diving any deeper into this particular program it's hard to give
more guidance however I will say that your lazy list Matrix
representation is very rarely the right choice for even toy linear
algebra problems.

First, consider the fact that even just a list cons constructor requires
three words (an info table pointer, a pointer to payload, and a pointer
to tail) plus the size of the payload (which in the case of an evaluated
`Float` is 2 words: one info table pointer and the floating point value
itself). So, a list of n *evaluated* `Float`s (which would require only
4*n bytes if packed densely) will require 40*n bytes if represented as a
lazy list.

Then, consider the fact that indexing into a lazy list is an O(n)
operation: this means that your `Math.column` operation on an n x m
matrix may be O(n*m). Even worse, `Math.columns`, as used by
`Math.matmul` is O(n * m!).

Finally, consider the fact that whenever you "construct" a lazy list you
aren't actually performing any computation: you are actually
constructing a single thunk which represents the entire result; however,
if you then go to index into the middle of that list you will end up
constructing n cons cells and a thunk for the payload of each. In the
case of primitive linear algebra operations the cost of constructing
this payload thunk can be greater than simply computing the result.

For these reasons I wouldn't recommend that lazy lists are used in this
way. If you have a dense matrix use an array (probably even unboxed;
see, for instance, the `array`, `vector`, and `repa` libraries); if
you have a sparse matrix then use an appropriate sparse representation
(although sadly good sparse linear algebra tools are hard to come by in
Haskell) Not only will the result be significantly more efficient in
space and time but the runtime behavior of the program will be
significantly easier to follow since you can more easily ensure that
evaluation occurs when you expect it to.

Hopefully this helps. Good luck and let us know if there are further
issues.

Cheers,

- Ben


[1] http://downloads.haskell.org/~ghc/master/users-guide//profiling.html#cost-centres-and-cost-centre-stacks