Seems to me that culprit is in function random as I have tested rest of code
and didn't found speed related problems.
The problem with your original program was that it was not pure enough. Because you stored your PRNG state in an IORef, you forced the program to allocate and case-inspect boxed Ints in its inner loop.
I refactored it slightly to make genRand and genRandom pure, and combined with using the LLVM back end, this doubled the program's performance, so that the Haskell program ran at the same speed as your C++ version.
The next bottleneck was that your program was performing floating point arithmetic in the inner loop. I changed it to precompute a small lookup table, followed by only using integer arithmetic in the inner loop (the same technique used by the fastest C fasta program). This further improved performance: the new Haskell code is 40% faster than the C++ program, and only ~20% slower than the C program that currently tops the shootout chart. The Haskell source is a little over half the size of the C source.