
Don Stewart ha scritto:
manlio_perillo: [...]
It is possible to implement a map reduce version that can handle gzipped log files?
Using the zlib binding on hackage.haskell.org, you can stream multiple zlib decompression threads with lazy bytestrings, and combine the results.
This is a bit hard. A deflate encoded stream contains multiple blocks, so you need to find the offset of each block and decompress it in parallel. But then you need also to make sure each final block terminates with a '\n'. And the zlib Haskell binding does not support this usage (I'm not even sure zlib support this). By the way, this phrase: "We allow multiple threads to read different chunks at once by supplying each one with a distinct file handle, all reading the same file" here: http://book.realworldhaskell.org/read/concurrent-and-multicore-programming.h... IMHO is not correct, or at least misleading. Each block is read in the main thread, or at least myThreadId return always the same value. This is also the reason why I don't understand why my version is slower then the book version. The only difference is that the book version reads 4 chunks and my version only 1 big chunk.
-- Don
Thanks Manlio