
Dear cafe I have 2 file, I want zip the 2 file as couple, and then count each couple's repeat times? The file had more than 40M rows, I use pipe to write code as blow. When I test with 8768000 rows input, it take 30 secs When I test with 18768000 rows input, it take 74 secs But when I test with whole file (40M rows), it take more than 20 minutes and not finished yet. It take more than 9G memorys, and the disk is also busy all time. The result will less than 10k rows, so I had no idea why the memory is so huge. I had use the “http://hackage.haskell.org/package/visual-prof” to profile and improve the performance with the small file But I don’t know how to deal with the “hang” situation. Anyone can give me some help, Thanks. =================================== import System.IO import System.Environment import Pipes import qualified Pipes.Prelude as P import qualified Data.Map as DM import Data.List emptyMap = DM.empty::(DM.Map (String,String) Int) keyCount num = do readHandle1 <- openFile "dataByColumn/click" ReadMode readHandle2 <- openFile "dataByColumn/hour" ReadMode writeHadle <- openFile "output" AppendMode rCount num readHandle1 readHandle2 writeHadle hClose writeHadle hClose readHandle1 hClose readHandle2 mapToString::DM.Map (String,String) Int-> String mapToString m = unlines $ map eachItem itemList where itemList = DM.toList m eachItem ((x,y),i) = show x ++ "," ++ show y ++ "," ++ show i --rCount::Int -> [String] -> Handle->Handle -> IO() rCount num readHandle1 readHandle2 writeHadle = do rt <- P.fold (\x y -> DM.unionWith (+) x y) emptyMap id $ P.zipWith (\x y -> DM.singleton (x,y) 1) (P.fromHandle readHandle1) (P.fromHandle readHandle2) >-> P.take num hPutStr writeHadle $ mapToString rt main = do s<- getArgs let num = (read . head) s keyCount num