How to deal with huge text file?

older
ANNOUNCE: Packages for computer...

Magicloud Magiclouds

25 May 2010 25 May '10

1:41 a.m.

Hi, I have a file including some operation logs, in the format of the following. And I have some code to analyze it. Well, it ate all my memories. --- log: Log for item A =========== 09:10 read accountA 09:20 read accountB Log for item B .... --- code: file <- U.readFile filename mapM_ (\text -> let ope = parseOPE text in if findMissingOperation ope then U.putStrLn $ show $ fst ope else return () ) $ seperateOutput file --This function separates the input text file into paragraph by "Log for". -- 竹密岂妨流水过山高哪阻野云飞

Show replies by date

Ivan Miljenovic

25 May 25 May

1:46 a.m.

On 25 May 2010 11:41, Magicloud Magiclouds wrote:

...

Hi, I have a file including some operation logs, in the format of the following. And I have some code to analyze it. Well, it ate all my memories. --- log: Log for item A =========== 09:10 read accountA 09:20 read accountB

Log for item B .... --- code: file <- U.readFile filename

Which module have you imported there with the `U'? For large text files, you probably want to use either lazy Bytestrings or lazy Text values. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Magicloud Magiclouds

2:02 a.m.

U is for UTF8 module. And I will try the modules you mentioned. Although I thought Haskell IO is lazy enough.... On Tue, May 25, 2010 at 9:46 AM, Ivan Miljenovic wrote:

...

On 25 May 2010 11:41, Magicloud Magiclouds wrote:

...
Hi, I have a file including some operation logs, in the format of the following. And I have some code to analyze it. Well, it ate all my memories. --- log: Log for item A =========== 09:10 read accountA 09:20 read accountB

Log for item B .... --- code: file <- U.readFile filename

Which module have you imported there with the `U'?

For large text files, you probably want to use either lazy Bytestrings or lazy Text values.

-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

-- 竹密岂妨流水过山高哪阻野云飞

Ivan Miljenovic

2:12 a.m.

On 25 May 2010 12:02, Magicloud Magiclouds wrote:

...

U is for UTF8 module. And I will try the modules you mentioned. Although I thought Haskell IO is lazy enough....

If you're only streaming data, it probably would be. However, you seem to keep some of it in memory, which is what the problem is. You might be able to fix this by doing "main = readFile filename >>= liftM separateOutput >>= mapM_ foo". However, it depends on what separateOutput does. Also, consider using when (from Control.Monad) instead of your if statement. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Magicloud Magiclouds

2:20 a.m.

This is the function. The problem sure seems like something was preserved unexpected. But I cannot find out where is the problem. seperateOutput file = let content = lines file indexOfEachOutput_ = fst $ unzip $ filter (\(i, l) -> " Log for " `isPrefixOf` l ) $ zip [0..] content indexOfEachOutput = indexOfEachOutput_ ++ [length content] in map (\(a, b) -> drop a $ take b content ) $ zip indexOfEachOutput $ tail indexOfEachOutput On Tue, May 25, 2010 at 10:12 AM, Ivan Miljenovic wrote:

...

On 25 May 2010 12:02, Magicloud Magiclouds wrote:

...
U is for UTF8 module. And I will try the modules you mentioned. Although I thought Haskell IO is lazy enough....

If you're only streaming data, it probably would be. However, you seem to keep some of it in memory, which is what the problem is. You might be able to fix this by doing "main = readFile filename >>= liftM separateOutput >>= mapM_ foo". However, it depends on what separateOutput does.

Also, consider using when (from Control.Monad) instead of your if statement.

-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

-- 竹密岂妨流水过山高哪阻野云飞

Ivan Miljenovic

2:26 a.m.

On 25 May 2010 12:20, Magicloud Magiclouds wrote:

...

This is the function. The problem sure seems like something was preserved unexpected. But I cannot find out where is the problem.

seperateOutput file = let content = lines file indexOfEachOutput_ = fst $ unzip $ filter (\(i, l) -> " Log for " `isPrefixOf` l ) $ zip [0..] content indexOfEachOutput = indexOfEachOutput_ ++ [length content] in

^^^^^^^^^^^^^^^^ Expensive bit

...

map (\(a, b) -> drop a $ take b content ) $ zip indexOfEachOutput $ tail indexOfEachOutput

You're not "streaming" the String; you're also keeping it around to calculate the length (I'm also not sure how GHC optimises that if at all; it might even re-evaluate the length each time you use indexOfEachOutput. The zipping of indexOfEachOutput should be OK without that length at the end, as it will lazy construct the zipped list (only evaluating up to two values at a time). However, you'd be better off using "zipWith f" rather than "map f . zip". -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Daniel Fischer

3:06 a.m.

On Tuesday 25 May 2010 04:26:07, Ivan Miljenovic wrote:

...

On 25 May 2010 12:20, Magicloud Magiclouds

wrote:

...
This is the function. The problem sure seems like something was preserved unexpected. But I cannot find out where is the problem.

seperateOutput file = let content = lines file indexOfEachOutput_ = fst $ unzip $ filter (\(i, l) -> " Log for " `isPrefixOf` l ) $ zip [0..] content indexOfEachOutput = indexOfEachOutput_ ++ [length content] in

^^^^^^^^^^^^^^^^

Expensive bit

...
map (\(a, b) -> drop a $ take b content ) $ zip indexOfEachOutput $ tail indexOfEachOutput

You're not "streaming" the String; you're also keeping it around to calculate the length (I'm also not sure how GHC optimises that if at all; it might even re-evaluate the length each time you use indexOfEachOutput.

Not that it helps, but it evaluates the length only once. But it does that at the very end, when dealing with the last log.

...

The zipping of indexOfEachOutput should be OK without that length at the end, as it will lazy construct the zipped list (only evaluating up to two values at a time). However, you'd be better off using "zipWith f" rather than "map f . zip".

There'd still be the problem of drop a $ take b content , so nothing can be garbage collected before everything's done. separateOutpout file = let contents = lines file split = break ("Log for " `isPrefixOf`) msplit [] = Nothing msplit lns = Just (split lns) in drop 1 $ unfoldr msplit contents should fix it.

Magicloud Magiclouds

6:12 a.m.

Yes, this code works with a little hack. Thank you. On Tue, May 25, 2010 at 11:06 AM, Daniel Fischer wrote:

...

On Tuesday 25 May 2010 04:26:07, Ivan Miljenovic wrote:

...
On 25 May 2010 12:20, Magicloud Magiclouds

wrote:

...
This is the function. The problem sure seems like something was preserved unexpected. But I cannot find out where is the problem.

seperateOutput file = let content = lines file indexOfEachOutput_ = fst $ unzip $ filter (\(i, l) -> " Log for " `isPrefixOf` l ) $ zip [0..] content indexOfEachOutput = indexOfEachOutput_ ++ [length content] in

^^^^^^^^^^^^^^^^

Expensive bit

...
map (\(a, b) -> drop a $ take b content ) $ zip indexOfEachOutput $ tail indexOfEachOutput

You're not "streaming" the String; you're also keeping it around to calculate the length (I'm also not sure how GHC optimises that if at all; it might even re-evaluate the length each time you use indexOfEachOutput.

Not that it helps, but it evaluates the length only once. But it does that at the very end, when dealing with the last log.

...
The zipping of indexOfEachOutput should be OK without that length at the end, as it will lazy construct the zipped list (only evaluating up to two values at a time). However, you'd be better off using "zipWith f" rather than "map f . zip".

There'd still be the problem of

drop a $ take b content

, so nothing can be garbage collected before everything's done.

separateOutpout file = let contents = lines file split = break ("Log for " `isPrefixOf`) msplit [] = Nothing msplit lns = Just (split lns) in drop 1 $ unfoldr msplit contents

should fix it.

-- 竹密岂妨流水过山高哪阻野云飞

Ivan Miljenovic

6:14 a.m.

On 25 May 2010 16:12, Magicloud Magiclouds wrote:

...

Yes, this code works with a little hack. Thank you.

I'm scared to ask: what pray tell is this little hack? -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Daniel Fischer

8:19 a.m.

On Tuesday 25 May 2010 08:14:13, Ivan Miljenovic wrote:

...

On 25 May 2010 16:12, Magicloud Magiclouds

wrote:

...
Yes, this code works with a little hack. Thank you.

I'm scared to ask: what pray tell is this little hack?

Looking at it again, probably making it work at all, because I never consumed the "Log for " lines, so produced an infinite list of empty lists :-/ Something like separateOutput :: String -> [[String]] separateOutput file = let contents = dropWhile (not . ("Log for " `isPrefixOf`)) $ lines file split [] = ([],[]) split (h:tl) = let (lg,rst) = break ("Log for " `isPrefixOf`) tl in (h:lg,rst) msplit [] = Nothing msplit lns = Just (split lns) in unfoldr msplit contents which really works :)

Ivan Lazar Miljenovic

8:44 a.m.

Daniel Fischer writes:

...

On Tuesday 25 May 2010 08:14:13, Ivan Miljenovic wrote:

...
On 25 May 2010 16:12, Magicloud Magiclouds

wrote:

...
Yes, this code works with a little hack. Thank you.

I'm scared to ask: what pray tell is this little hack?

Looking at it again, probably making it work at all, because I never consumed the "Log for " lines, so produced an infinite list of empty lists :-/

Oh, yeah, I've done that before (with custom chunking functions, etc.). Just so you know, the split package might already have a function to do what you want... -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Daniel Fischer

9:17 a.m.

On Tuesday 25 May 2010 10:44:57, Ivan Lazar Miljenovic wrote:

...

Daniel Fischer writes:

...
On Tuesday 25 May 2010 08:14:13, Ivan Miljenovic wrote:

...
On 25 May 2010 16:12, Magicloud Magiclouds

wrote:

...
Yes, this code works with a little hack. Thank you.

I'm scared to ask: what pray tell is this little hack?

Looking at it again, probably making it work at all, because I never consumed the "Log for " lines, so produced an infinite list of empty lists

:-/

Oh, yeah, I've done that before (with custom chunking functions, etc.).

Who hasn't?

...

Just so you know, the split package might already have a function to do what you want...

Sort of. Not one function, but the building blocks: import Data.List (isPrefixOf) import Data.List.Split msplit = split (keepDelimsL $ whenElt ("Log for " `isPrefixOf`)) I think. But who looks at API docs in the dead of the night when cooking up the function seems so easy :)

5520

Age (days ago)

5520

Last active (days ago)

List overview

Download

11 comments

4 participants

participants (4)

Daniel Fischer
Ivan Lazar Miljenovic
Ivan Miljenovic
Magicloud Magiclouds

How to deal with huge text file?

tags

participants (4)