
Dear everyone, I'm always grateful to your help. I have been assigned a complicated and growing task in which I'll perform a lot of discrete Fourier transforms, so I have measured performance of several DFT libraries in Haskell: http://en.pk.paraiso-lang.org/Hackage/what-is-the-fastest-dft-in-haskell/mai... The raw result: http://paraiso-lang.org/html/bench-dft-in-haskell.html I'll share the result in hope that some of you will also find this result useful. Also, please let me know any possible flaws or improvements in the benchmark process! My observations are as follows: * vector-fftw with wisdom was more than 1/2 times faster than fftw in C with wisdom (and with communication overhead.) * vector-fftw without wisdom was significantly _faster_ than fftw in C without wisdom. I wonder why. * vector-fftw over vector was faster than fft over CArray. * any library that doesn't use fftw is much slower than those that does. Best, -- Takayuki MURANUSHI The Hakubi Center for Advanced Research, Kyoto University http://www.hakubi.kyoto-u.ac.jp/02_mem/h22/muranushi.html

Takayuki Muranushi
* vector-fftw with wisdom was more than 1/2 times faster than fftw in C with wisdom (and with communication overhead.) * vector-fftw without wisdom was significantly _faster_ than fftw in C without wisdom. I wonder why. * vector-fftw over vector was faster than fft over CArray. * any library that doesn't use fftw is much slower than those that does.
I have no experience with FFTW, but in general a result like this often means that you may not have actually calculated the values themselves. One easy way to ensure this is to print out the whole result. If you feel like printing takes too much CPU time for comparison, you need to force deeply like with deepseq. Notably Data.Vector is a lazy data structure. If you force the vector itself, you are not forcing the individual values. For FFT I would assume that the length of the resulting vector does not depend on any values. Greets, Ertugrul -- Not to be or to be and (not to be or to be and (not to be or to be and (not to be or to be and ... that is the list monad.

Ertugrul:
I might be missing something in translation, but if I understand Takayuki's
message's intent, everything needs to be calculated because the C-based
FFTW library is called (eventually). Laziness doesn't really have an impact.
The choice of underlying data structure and whether FFTW wisdom is used
clearly has a significant impact.
FFTW and Intel's MKL libraries are the acknowledged "state of the art"
libraries for performing discrete Fourier transforms. I'm not sure there's
anything better or faster for CPU implementations (I know there's a O(1)
implementation for map-reduce systems and NVIDIA's CUDA-FFT. Note that the
map-reduce approach has a preprocessing step that isn't O(1).) Interesting
to note that much of the code for FFTW was initially generated using OCaml
to find optimal versions of code for particular problem sizes.
-scooter
On Sun, Aug 5, 2012 at 6:37 PM, Ertugrul Söylemez
Takayuki Muranushi
wrote: * vector-fftw with wisdom was more than 1/2 times faster than fftw in C with wisdom (and with communication overhead.) * vector-fftw without wisdom was significantly _faster_ than fftw in C without wisdom. I wonder why. * vector-fftw over vector was faster than fft over CArray. * any library that doesn't use fftw is much slower than those that does.
I have no experience with FFTW, but in general a result like this often means that you may not have actually calculated the values themselves. One easy way to ensure this is to print out the whole result. If you feel like printing takes too much CPU time for comparison, you need to force deeply like with deepseq.
Notably Data.Vector is a lazy data structure. If you force the vector itself, you are not forcing the individual values. For FFT I would assume that the length of the resulting vector does not depend on any values.
Greets, Ertugrul
-- Not to be or to be and (not to be or to be and (not to be or to be and (not to be or to be and ... that is the list monad.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Scott Michel
I might be missing something in translation, but if I understand Takayuki's message's intent, everything needs to be calculated because the C-based FFTW library is called (eventually). Laziness doesn't really have an impact.
The choice of underlying data structure and whether FFTW wisdom is used clearly has a significant impact.
If the Haskell wrapper library is a thick enough, lazy layer around
FFTW, the size of the result vector may not at all depend on any FFTW
computation.
Again, I have no experience at all with FFTW or any Haskell bindings to
it. This is just a general remark that is worth keeping in mind.
Greets,
Ertugrul
--
Key-ID: E5DD8D11 "Ertugrul Soeylemez

Takayuki Muranushi wrote:
* vector-fftw with wisdom was more than 1/2 times faster than fftw in C with wisdom (and with communication overhead.)
I would be suspicious of that result. Calling a C function from a library should be slower from Haskell than from C. Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

Dear Ertugrul, Scott and Erik, thank you for your comments. w.r.t the lazyness, I make the solvers to calculate the amplitude of final FFT results (i.e. to calculate the square magnitude of array elements and sum over them,) compare the response with the expected results and cause side effects depending on the test result. This should cause the FFT chain to be fully evaluated.
* vector-fftw with wisdom was more than 1/2 times faster than fftw in C with wisdom (and with communication overhead.)
I would be suspicious of that result. Calling a C function from a library should be slower from Haskell than from C.
Sorry for the confusion, What I meant is that vector-fftw version takes
more time than C version, but less than twice. Please compare the two lines
* "fft/cpp 1 1048576 102"
* "fft/vector-fftw 0 1048576 102"
in http://paraiso-lang.org/html/bench-dft-in-haskell.html .
P.S. including GPU contestants would be interesting!
2012/8/6 Erik de Castro Lopo
Takayuki Muranushi wrote:
* vector-fftw with wisdom was more than 1/2 times faster than fftw in C with wisdom (and with communication overhead.)
I would be suspicious of that result. Calling a C function from a library should be slower from Haskell than from C.
Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Best, -- Takayuki MURANUSHI The Hakubi Center for Advanced Research, Kyoto University http://www.hakubi.kyoto-u.ac.jp/02_mem/h22/muranushi.html

Takayuki Muranushi wrote:
* vector-fftw with wisdom was more than 1/2 times faster than fftw in C with wisdom (and with communication overhead.)
I would be suspicious of that result. Calling a C function from a library should be slower from Haskell than from C.
Sorry for the confusion, What I meant is that vector-fftw version takes more time than C version, but less than twice.
That makes much more sense. Whether you're calling fftw from C or from Haskell, its still the fftw library doing most of the work. As you increase the FFT length, the difference between C and Haskell should decrease. Cheers, Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/
participants (4)
-
Erik de Castro Lopo
-
Ertugrul Söylemez
-
Scott Michel
-
Takayuki Muranushi