How to deal with huge text file?

Hi, I have a file including some operation logs, in the format of the following. And I have some code to analyze it. Well, it ate all my memories. --- log: Log for item A =========== 09:10 read accountA 09:20 read accountB Log for item B .... --- code: file <- U.readFile filename mapM_ (\text -> let ope = parseOPE text in if findMissingOperation ope then U.putStrLn $ show $ fst ope else return () ) $ seperateOutput file --This function separates the input text file into paragraph by "Log for". -- 竹密岂妨流水过 山高哪阻野云飞

On 25 May 2010 11:41, Magicloud Magiclouds
Hi, I have a file including some operation logs, in the format of the following. And I have some code to analyze it. Well, it ate all my memories. --- log: Log for item A =========== 09:10 read accountA 09:20 read accountB
Log for item B .... --- code: file <- U.readFile filename
Which module have you imported there with the `U'? For large text files, you probably want to use either lazy Bytestrings or lazy Text values. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

U is for UTF8 module. And I will try the modules you mentioned.
Although I thought Haskell IO is lazy enough....
On Tue, May 25, 2010 at 9:46 AM, Ivan Miljenovic
On 25 May 2010 11:41, Magicloud Magiclouds
wrote: Hi, I have a file including some operation logs, in the format of the following. And I have some code to analyze it. Well, it ate all my memories. --- log: Log for item A =========== 09:10 read accountA 09:20 read accountB
Log for item B .... --- code: file <- U.readFile filename
Which module have you imported there with the `U'?
For large text files, you probably want to use either lazy Bytestrings or lazy Text values.
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com
-- 竹密岂妨流水过 山高哪阻野云飞

On 25 May 2010 12:02, Magicloud Magiclouds
U is for UTF8 module. And I will try the modules you mentioned. Although I thought Haskell IO is lazy enough....
If you're only streaming data, it probably would be. However, you seem to keep some of it in memory, which is what the problem is. You might be able to fix this by doing "main = readFile filename >>= liftM separateOutput >>= mapM_ foo". However, it depends on what separateOutput does. Also, consider using when (from Control.Monad) instead of your if statement. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

This is the function. The problem sure seems like something was
preserved unexpected. But I cannot find out where is the problem.
seperateOutput file =
let content = lines file
indexOfEachOutput_ = fst $ unzip $ filter (\(i, l) ->
" Log for " `isPrefixOf` l
) $ zip [0..] content
indexOfEachOutput = indexOfEachOutput_ ++ [length content] in
map (\(a, b) ->
drop a $ take b content
) $ zip indexOfEachOutput $ tail indexOfEachOutput
On Tue, May 25, 2010 at 10:12 AM, Ivan Miljenovic
On 25 May 2010 12:02, Magicloud Magiclouds
wrote: U is for UTF8 module. And I will try the modules you mentioned. Although I thought Haskell IO is lazy enough....
If you're only streaming data, it probably would be. However, you seem to keep some of it in memory, which is what the problem is. You might be able to fix this by doing "main = readFile filename >>= liftM separateOutput >>= mapM_ foo". However, it depends on what separateOutput does.
Also, consider using when (from Control.Monad) instead of your if statement.
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com
-- 竹密岂妨流水过 山高哪阻野云飞

On 25 May 2010 12:20, Magicloud Magiclouds
This is the function. The problem sure seems like something was preserved unexpected. But I cannot find out where is the problem.
seperateOutput file = let content = lines file indexOfEachOutput_ = fst $ unzip $ filter (\(i, l) -> " Log for " `isPrefixOf` l ) $ zip [0..] content indexOfEachOutput = indexOfEachOutput_ ++ [length content] in
^^^^^^^^^^^^^^^^ Expensive bit
map (\(a, b) -> drop a $ take b content ) $ zip indexOfEachOutput $ tail indexOfEachOutput
You're not "streaming" the String; you're also keeping it around to calculate the length (I'm also not sure how GHC optimises that if at all; it might even re-evaluate the length each time you use indexOfEachOutput. The zipping of indexOfEachOutput should be OK without that length at the end, as it will lazy construct the zipped list (only evaluating up to two values at a time). However, you'd be better off using "zipWith f" rather than "map f . zip". -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Tuesday 25 May 2010 04:26:07, Ivan Miljenovic wrote:
On 25 May 2010 12:20, Magicloud Magiclouds
wrote: This is the function. The problem sure seems like something was preserved unexpected. But I cannot find out where is the problem.
seperateOutput file = let content = lines file indexOfEachOutput_ = fst $ unzip $ filter (\(i, l) -> " Log for " `isPrefixOf` l ) $ zip [0..] content indexOfEachOutput = indexOfEachOutput_ ++ [length content] in
^^^^^^^^^^^^^^^^
Expensive bit
map (\(a, b) -> drop a $ take b content ) $ zip indexOfEachOutput $ tail indexOfEachOutput
You're not "streaming" the String; you're also keeping it around to calculate the length (I'm also not sure how GHC optimises that if at all; it might even re-evaluate the length each time you use indexOfEachOutput.
Not that it helps, but it evaluates the length only once. But it does that at the very end, when dealing with the last log.
The zipping of indexOfEachOutput should be OK without that length at the end, as it will lazy construct the zipped list (only evaluating up to two values at a time). However, you'd be better off using "zipWith f" rather than "map f . zip".
There'd still be the problem of drop a $ take b content , so nothing can be garbage collected before everything's done. separateOutpout file = let contents = lines file split = break ("Log for " `isPrefixOf`) msplit [] = Nothing msplit lns = Just (split lns) in drop 1 $ unfoldr msplit contents should fix it.

Yes, this code works with a little hack. Thank you.
On Tue, May 25, 2010 at 11:06 AM, Daniel Fischer
On Tuesday 25 May 2010 04:26:07, Ivan Miljenovic wrote:
On 25 May 2010 12:20, Magicloud Magiclouds
wrote: This is the function. The problem sure seems like something was preserved unexpected. But I cannot find out where is the problem.
seperateOutput file = let content = lines file indexOfEachOutput_ = fst $ unzip $ filter (\(i, l) -> " Log for " `isPrefixOf` l ) $ zip [0..] content indexOfEachOutput = indexOfEachOutput_ ++ [length content] in
^^^^^^^^^^^^^^^^
Expensive bit
map (\(a, b) -> drop a $ take b content ) $ zip indexOfEachOutput $ tail indexOfEachOutput
You're not "streaming" the String; you're also keeping it around to calculate the length (I'm also not sure how GHC optimises that if at all; it might even re-evaluate the length each time you use indexOfEachOutput.
Not that it helps, but it evaluates the length only once. But it does that at the very end, when dealing with the last log.
The zipping of indexOfEachOutput should be OK without that length at the end, as it will lazy construct the zipped list (only evaluating up to two values at a time). However, you'd be better off using "zipWith f" rather than "map f . zip".
There'd still be the problem of
drop a $ take b content
, so nothing can be garbage collected before everything's done.
separateOutpout file = let contents = lines file split = break ("Log for " `isPrefixOf`) msplit [] = Nothing msplit lns = Just (split lns) in drop 1 $ unfoldr msplit contents
should fix it.
-- 竹密岂妨流水过 山高哪阻野云飞

On Tuesday 25 May 2010 08:14:13, Ivan Miljenovic wrote:
On 25 May 2010 16:12, Magicloud Magiclouds
wrote: Yes, this code works with a little hack. Thank you.
I'm scared to ask: what pray tell is this little hack?
Looking at it again, probably making it work at all, because I never consumed the "Log for " lines, so produced an infinite list of empty lists :-/ Something like separateOutput :: String -> [[String]] separateOutput file = let contents = dropWhile (not . ("Log for " `isPrefixOf`)) $ lines file split [] = ([],[]) split (h:tl) = let (lg,rst) = break ("Log for " `isPrefixOf`) tl in (h:lg,rst) msplit [] = Nothing msplit lns = Just (split lns) in unfoldr msplit contents which really works :)

Daniel Fischer
On Tuesday 25 May 2010 08:14:13, Ivan Miljenovic wrote:
On 25 May 2010 16:12, Magicloud Magiclouds
wrote: Yes, this code works with a little hack. Thank you.
I'm scared to ask: what pray tell is this little hack?
Looking at it again, probably making it work at all, because I never consumed the "Log for " lines, so produced an infinite list of empty lists :-/
Oh, yeah, I've done that before (with custom chunking functions, etc.). Just so you know, the split package might already have a function to do what you want... -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Tuesday 25 May 2010 10:44:57, Ivan Lazar Miljenovic wrote:
Daniel Fischer
writes: On Tuesday 25 May 2010 08:14:13, Ivan Miljenovic wrote:
On 25 May 2010 16:12, Magicloud Magiclouds
wrote: Yes, this code works with a little hack. Thank you.
I'm scared to ask: what pray tell is this little hack?
Looking at it again, probably making it work at all, because I never consumed the "Log for " lines, so produced an infinite list of empty lists
:-/
Oh, yeah, I've done that before (with custom chunking functions, etc.).
Who hasn't?
Just so you know, the split package might already have a function to do what you want...
Sort of. Not one function, but the building blocks: import Data.List (isPrefixOf) import Data.List.Split msplit = split (keepDelimsL $ whenElt ("Log for " `isPrefixOf`)) I think. But who looks at API docs in the dead of the night when cooking up the function seems so easy :)
participants (4)
-
Daniel Fischer
-
Ivan Lazar Miljenovic
-
Ivan Miljenovic
-
Magicloud Magiclouds