performance of map reduce

Hi again. In http://book.realworldhaskell.org/read/concurrent-and-multicore-programming.h... there is a map reduce based log parser. I have written an alternative version: http://paste.pocoo.org/show/85699/ but, with a file of 315 MB, I have [1]: 1) map reduce implementation, non parallel real 0m6.643s user 0m6.252s sys 0m0.212s 2) map reduce implementation, parallel with 2 cores real 0m3.840s user 0m6.384s sys 0m0.652s 3) my implementation real 0m8.121s user 0m7.804s sys 0m0.216s What is the reason of the map reduce implementation being faster, even if not parallelized? It is possible to implement a map reduce version that can handle gzipped log files? [1] These tests does not consider the "first run". For the first run (no data in OS cache), I have (not verified): 1) map reduce implementation, parallel with 2 cores real 0m3.735s user 0m6.328s sys 0m0.604s 2) my implementation real 0m13.659s user 0m7.712s sys 0m0.360s Thanks Manlio Perillo

manlio_perillo:
Hi again.
In http://book.realworldhaskell.org/read/concurrent-and-multicore-programming.h... there is a map reduce based log parser.
I have written an alternative version: http://paste.pocoo.org/show/85699/
but, with a file of 315 MB, I have [1]:
1) map reduce implementation, non parallel real 0m6.643s user 0m6.252s sys 0m0.212s
2) map reduce implementation, parallel with 2 cores real 0m3.840s user 0m6.384s sys 0m0.652s
3) my implementation real 0m8.121s user 0m7.804s sys 0m0.216s
What is the reason of the map reduce implementation being faster, even if not parallelized?
Changes in how GC is utilised, or how optimisation works?
It is possible to implement a map reduce version that can handle gzipped log files?
Using the zlib binding on hackage.haskell.org, you can stream multiple zlib decompression threads with lazy bytestrings, and combine the results. -- Don

Hello Don, Friday, September 19, 2008, 9:12:43 PM, you wrote:
It is possible to implement a map reduce version that can handle gzipped log files?
Using the zlib binding on hackage.haskell.org, you can stream multiple zlib decompression threads with lazy bytestrings, and combine the results.
for gzip decompression you need to decompress all the previous data, so decompression of single file cannot be multithreaded -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Don Stewart ha scritto:
manlio_perillo: [...]
It is possible to implement a map reduce version that can handle gzipped log files?
Using the zlib binding on hackage.haskell.org, you can stream multiple zlib decompression threads with lazy bytestrings, and combine the results.
This is a bit hard. A deflate encoded stream contains multiple blocks, so you need to find the offset of each block and decompress it in parallel. But then you need also to make sure each final block terminates with a '\n'. And the zlib Haskell binding does not support this usage (I'm not even sure zlib support this). By the way, this phrase: "We allow multiple threads to read different chunks at once by supplying each one with a distinct file handle, all reading the same file" here: http://book.realworldhaskell.org/read/concurrent-and-multicore-programming.h... IMHO is not correct, or at least misleading. Each block is read in the main thread, or at least myThreadId return always the same value. This is also the reason why I don't understand why my version is slower then the book version. The only difference is that the book version reads 4 chunks and my version only 1 big chunk.
-- Don
Thanks Manlio

On Fri, Sep 19, 2008 at 2:31 PM, Manlio Perillo
By the way, this phrase: "We allow multiple threads to read different chunks at once by supplying each one with a distinct file handle, all reading the same file" here:
http://book.realworldhaskell.org/read/concurrent-and-multicore-programming.h...
IMHO is not correct, or at least misleading.
It's both correct and, er, leading. The files are opened in a single thread, and then the file handles are read by multiple threads.

Bryan O'Sullivan ha scritto:
On Fri, Sep 19, 2008 at 2:31 PM, Manlio Perillo
mailto:manlio_perillo@libero.it> wrote: By the way, this phrase: "We allow multiple threads to read different chunks at once by supplying each one with a distinct file handle, all reading the same file" here: http://book.realworldhaskell.org/read/concurrent-and-multicore-programming.h...
IMHO is not correct, or at least misleading.
It's both correct and, er, leading. The files are opened in a single thread, and then the file handles are read by multiple threads.
Ah, right, thanks. They are read with ByteString.Lazy. Manlio Perillo
participants (4)
-
Bryan O'Sullivan
-
Bulat Ziganshin
-
Don Stewart
-
Manlio Perillo