How can I improve the pipes's performance with a huge file?

Dear cafe I have 2 file, I want zip the 2 file as couple, and then count each couple's repeat times? The file had more than 40M rows, I use pipe to write code as blow. When I test with 8768000 rows input, it take 30 secs When I test with 18768000 rows input, it take 74 secs But when I test with whole file (40M rows), it take more than 20 minutes and not finished yet. It take more than 9G memorys, and the disk is also busy all time. The result will less than 10k rows, so I had no idea why the memory is so huge. I had use the “http://hackage.haskell.org/package/visual-prof” to profile and improve the performance with the small file But I don’t know how to deal with the “hang” situation. Anyone can give me some help, Thanks. =================================== import System.IO import System.Environment import Pipes import qualified Pipes.Prelude as P import qualified Data.Map as DM import Data.List emptyMap = DM.empty::(DM.Map (String,String) Int) keyCount num = do readHandle1 <- openFile "dataByColumn/click" ReadMode readHandle2 <- openFile "dataByColumn/hour" ReadMode writeHadle <- openFile "output" AppendMode rCount num readHandle1 readHandle2 writeHadle hClose writeHadle hClose readHandle1 hClose readHandle2 mapToString::DM.Map (String,String) Int-> String mapToString m = unlines $ map eachItem itemList where itemList = DM.toList m eachItem ((x,y),i) = show x ++ "," ++ show y ++ "," ++ show i --rCount::Int -> [String] -> Handle->Handle -> IO() rCount num readHandle1 readHandle2 writeHadle = do rt <- P.fold (\x y -> DM.unionWith (+) x y) emptyMap id $ P.zipWith (\x y -> DM.singleton (x,y) 1) (P.fromHandle readHandle1) (P.fromHandle readHandle2) >-> P.take num hPutStr writeHadle $ mapToString rt main = do s<- getArgs let num = (read . head) s keyCount num

On Fri, Nov 14, 2014 at 05:43:15PM +0800, zhangjun.julian wrote:
But when I test with whole file (40M rows), it take more than 20 minutes and not finished yet. It take more than 9G memorys, and the disk is also busy all time.
import qualified Data.Map as DM
At the very least you should be using import qualified Data.Map.Strict as DM

Dear Tom I change Map to Strict,it be little fast when test with 18M rows, but it hanged again with 40M rows. Do you have any other advice?
在 2014年11月14日,下午5:52,Tom Ellis
写道: On Fri, Nov 14, 2014 at 05:43:15PM +0800, zhangjun.julian wrote:
But when I test with whole file (40M rows), it take more than 20 minutes and not finished yet. It take more than 9G memorys, and the disk is also busy all time.
import qualified Data.Map as DM
At the very least you should be using
import qualified Data.Map.Strict as DM _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Fri, Nov 14, 2014 at 09:14:17PM +0800, zhangjun.julian wrote:
Dear Tom
I change Map to Strict,it be little fast when test with 18M rows, but it hanged again with 40M rows.
Do you have any other advice?
Dear Zhangjun Julian, Perhaps too much of the output string is being kept around when it is printed. I would try mapM_ (\((x,y), i) -> hPutStrLn writeHadle (show x ++ "," ++ show y ++ "," ++ show i)) (DM.toList rt) instead of hPutStr writeHadle $ mapToString rt Apart from that, I don't have any other ideas. Can you determine whether the large memory usage comes from the Pipe or from the printing of the result? Tom

Are you compiling? Just recently I had someone complain that Haskell wouldn't handle large files but he was using the interpreter. After compiling his problem vanished. -- -- Sent from an expensive device which will be obsolete in a few months! :D Casey On Nov 14, 2014 5:54 AM, "Tom Ellis" < tom-lists-haskell-cafe-2013@jaguarpaw.co.uk> wrote:
On Fri, Nov 14, 2014 at 09:14:17PM +0800, zhangjun.julian wrote:
Dear Tom
I change Map to Strict,it be little fast when test with 18M rows, but it hanged again with 40M rows.
Do you have any other advice?
Dear Zhangjun Julian,
Perhaps too much of the output string is being kept around when it is printed. I would try
mapM_ (\((x,y), i) -> hPutStrLn writeHadle (show x ++ "," ++ show y ++ "," ++ show i)) (DM.toList rt)
instead of
hPutStr writeHadle $ mapToString rt
Apart from that, I don't have any other ideas. Can you determine whether the large memory usage comes from the Pipe or from the printing of the result?
Tom _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 14.11.2014 10:43, zhangjun.julian wrote:
emptyMap = DM.empty::(DM.Map (String,String) Int)
Laziness makes your data swell. 1) Try using ByteString or Text instead of String. 2) Try the UNPACK pragma, AFAIR it requires -O2. data Key = Key {-# UNPACK #-} !ByteString {-# UNPACK #-} !ByteString https://hackage.haskell.org/package/ghc-datasize - this package will help you to determine the actual data size -- Wojtek

On Fri, Nov 14, 2014 at 05:47:16PM +0100, Wojtek Narczyński wrote:
On 14.11.2014 10:43, zhangjun.julian wrote:
emptyMap = DM.empty::(DM.Map (String,String) Int)
Laziness makes your data swell.
1) Try using ByteString or Text instead of String. 2) Try the UNPACK pragma, AFAIR it requires -O2. data Key = Key {-# UNPACK #-} !ByteString {-# UNPACK #-} !ByteString https://hackage.haskell.org/package/ghc-datasize - this package will help you to determine the actual data size
This is certainly true, but there is a distinction to be drawn between "swollen data" that is a few times bigger than it could be, and a space leak. Zhangjun Julian's biggest problem is definitely the latter. There's no reason that compiling a dictionary counting occurences and printing it out should consume 9GB. Once the space leak is fixed your suggestions will help reduce memory usage further. Tom

On 14.11.2014 18:31, Tom Ellis wrote:
Zhangjun Julian's biggest problem is definitely the latter. There's no reason that compiling a dictionary counting occurences and printing it out should consume 9GB. Once the space leak is fixed your suggestions will help reduce memory usage further.
Right, I missed that the expected cardinality of the set is 10K.

Dear Tom and others I’m sorry. I think I had made a mistake, I test Tom’s advice in my master branch not in the demo code. In the master branch I had a list file to read, so I use mapM_ to call rCount as blow mapM_ (\(x,y) -> rCount num readhandle1 x y) handlePairList If I change my Map to Strict and call rCount directly( don’t use mapM_ ) the memory will not swell. I can understand why lazy Map will cause swell, but I don’t know why mapM_ will cause swell? Does the mapM_ is lazy too? Any strict alternative I can use?
在 2014年11月15日,上午1:31,Tom Ellis
写道: On Fri, Nov 14, 2014 at 05:47:16PM +0100, Wojtek Narczyński wrote:
On 14.11.2014 10:43, zhangjun.julian wrote:
emptyMap = DM.empty::(DM.Map (String,String) Int)
Laziness makes your data swell.
1) Try using ByteString or Text instead of String. 2) Try the UNPACK pragma, AFAIR it requires -O2. data Key = Key {-# UNPACK #-} !ByteString {-# UNPACK #-} !ByteString https://hackage.haskell.org/package/ghc-datasize - this package will help you to determine the actual data size
This is certainly true, but there is a distinction to be drawn between "swollen data" that is a few times bigger than it could be, and a space leak.
Zhangjun Julian's biggest problem is definitely the latter. There's no reason that compiling a dictionary counting occurences and printing it out should consume 9GB. Once the space leak is fixed your suggestions will help reduce memory usage further.
Tom _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 14.11.2014 23:50, zhangjun.julian wrote:
If I change my Map to Strict and call rCount directly( don’t use mapM_ ) the memory will not swell.
I used the word "swell" to describe the phenomenon that in Haskell pointers can consume vast amounts of memory, especially on 64 bit architectures. For example type (Bool,Bool,Bool,Bool,Bool,Bool,Bool,Bool), one byte of information, will take 136 bytes, unless you fight the laziness feature with ! and UNPACK. Okay, this is an evil contrived example, but you get the idea. It is not generally accepted nomenclature. You were hit by what is referred to as "space leak", that is build-up of unevaluated closures. Anyway, I'm glad to hear that your problem is gone. -- Kind reagards, Wojtek

On Sat, Nov 15, 2014 at 06:50:26AM +0800, zhangjun.julian wrote:
In the master branch I had a list file to read, so I use mapM_ to call rCount as blow
mapM_ (\(x,y) -> rCount num readhandle1 x y) handlePairList
If I change my Map to Strict and call rCount directly( don’t use mapM_ ) the memory will not swell.
I can understand why lazy Map will cause swell, but I don’t know why mapM_ will cause swell? Does the mapM_ is lazy too? Any strict alternative I can use?
I think you need to provide us with more details about exactly what this mapM_ is doing.

Dear Tom and others I had fix the problem, with change code by 1, Change List to Sequence 2, Change mapM_ to Data.Traversable.sequence $ Sequence.map Thanks for you help
在 2014年11月16日,下午5:47,Tom Ellis
写道: On Sat, Nov 15, 2014 at 06:50:26AM +0800, zhangjun.julian wrote:
In the master branch I had a list file to read, so I use mapM_ to call rCount as blow
mapM_ (\(x,y) -> rCount num readhandle1 x y) handlePairList
If I change my Map to Strict and call rCount directly( don’t use mapM_ ) the memory will not swell.
I can understand why lazy Map will cause swell, but I don’t know why mapM_ will cause swell? Does the mapM_ is lazy too? Any strict alternative I can use?
I think you need to provide us with more details about exactly what this mapM_ is doing. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (4)
-
KC
-
Tom Ellis
-
Wojtek Narczyński
-
zhangjun.julian