Sneaking haskell in the workplace -- cleaning csv files

I need to remove newlines from csv files (within columns, not at the end of entire lines). This is prior to importing into a database and was being done at my workplace by a java class for quite a while until the files processed got bigger and it proved to be too slow. (The files are up to ~250MB at the moment) It was rewritten in PL/SQL, to run after the import, which was an improvement, but it still has our creaky db server thrashing away. (You may have lots of helpful suggestions in mind, but we can't clean the data at source and AFAIK we can't do it incrementally because there is no timestamp or anything on the last change to a row from the legacy db.) We don't need a general solution - if a line ends with a delimiter we can be sure it's the end of the entire line because that's the way the csv files are generated. I had a quick go with ByteString (with no attempt at robustness etc) and although I haven't compared it properly it seems faster than what we have now. But you can easily make it faster, surely! Hints for improvement please (e.g. can I unbox anything, make anything strict, or is that handled by ByteString, is there a more efficient library function to replace the fold...?). module Main where import System.Environment (getArgs) import qualified Data.ByteString.Char8 as B --remove newlines in the middle of 'columns' clean :: Char -> [B.ByteString] -> [B.ByteString] clean d = foldr (\x ys -> if B.null x || B.last x == d then x:ys else (B.append x $ head ys):(tail ys)) [] main = do args <- getArgs if length args < 2 then putStrLn "Usage: crunchFile INFILE OUTFILE [DELIM]" else do bs <- B.readFile (args!!0) let d = if length args == 3 then head (args!!2) else '"' B.writeFile (args!!1) $ (B.unlines . clean d . B.lines) bs Thanks, Jim -- View this message in context: http://www.nabble.com/Sneaking-haskell-in-the-workplace----cleaning-csv-file... Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.

Thomas Schilling wrote:
On 15 jun 2007, at 18.13, Jim Burton wrote:
import qualified Data.ByteString.Char8 as B
Have you tried
import qualified Data.ByteString.Lazy.Char8 as B
?
No -- I'll give it a try and compare them. Is laziness preferable here? Thanks,

On 15 jun 2007, at 21.14, Jim Burton wrote:
Thomas Schilling wrote:
On 15 jun 2007, at 18.13, Jim Burton wrote:
import qualified Data.ByteString.Char8 as B
Have you tried import qualified Data.ByteString.Lazy.Char8 as B ?
No -- I'll give it a try and compare them. Is laziness preferable here?
yes, since you were talking of big files. if you don't have to keep the data around lazy bytestrings will keep the memory footprint low.

On 15/06/07, Jim Burton
I need to remove newlines from csv files (within columns, not at the end of entire lines). This is prior to importing into a database and was being done at my workplace by a java class for quite a while until the files processed got bigger and it proved to be too slow. (The files are up to ~250MB at the moment) It was rewritten in PL/SQL, to run after the import, which was an improvement, but it still has our creaky db server thrashing away. (You may have lots of helpful suggestions in mind, but we can't clean the data at source and AFAIK we can't do it incrementally because there is no timestamp or anything on the last change to a row from the legacy db.)
We don't need a general solution - if a line ends with a delimiter we can be sure it's the end of the entire line because that's the way the csv files are generated.
I had a quick go with ByteString (with no attempt at robustness etc) and although I haven't compared it properly it seems faster than what we have now. But you can easily make it faster, surely! Hints for improvement please (e.g. can I unbox anything, make anything strict, or is that handled by ByteString, is there a more efficient library function to replace the fold...?).
module Main where import System.Environment (getArgs) import qualified Data.ByteString.Char8 as B
--remove newlines in the middle of 'columns' clean :: Char -> [B.ByteString] -> [B.ByteString] clean d = foldr (\x ys -> if B.null x || B.last x == d then x:ys else (B.append x $ head ys):(tail ys)) []
main = do args <- getArgs if length args < 2 then putStrLn "Usage: crunchFile INFILE OUTFILE [DELIM]" else do bs <- B.readFile (args!!0) let d = if length args == 3 then head (args!!2) else '"' B.writeFile (args!!1) $ (B.unlines . clean d . B.lines) bs
Hi, I haven't compiled this, but you get the general idea: import qualified Data.ByteString.Lazy.Char8 as B -- takes a bytestring representing the file, concats the lines -- then splits it up into "real" lines using the delimiter clean :: Char -> B.ByteString -> [B.ByteString] clean' d = B.split d . B.concat . B.lines -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862

Sebastian Sylvan wrote:
On 15/06/07, Jim Burton
wrote: [snip] Hi, Hi Sebastian, I haven't compiled this, but you get the general idea: import qualified Data.ByteString.Lazy.Char8 as B -- takes a bytestring representing the file, concats the lines -- then splits it up into "real" lines using the delimiter clean :: Char -> B.ByteString -> [B.ByteString] clean' d = B.split d . B.concat . B.lines
I think that would only work if there was one column per line...I didn't make it clear that as well as being comma separated, the delimiter is around each column, of which there are several on a line so if the delimiter is ~ a file might look like: ~sdlkfj~, ~dsdkjf~ #eo row1 ~sdf dfkj~, ~dfsd~ #eo row 2

On 6/15/07, Jim Burton
Sebastian Sylvan wrote:
On 15/06/07, Jim Burton
wrote: [snip] Hi, Hi Sebastian, I haven't compiled this, but you get the general idea: import qualified Data.ByteString.Lazy.Char8 as B -- takes a bytestring representing the file, concats the lines -- then splits it up into "real" lines using the delimiter clean :: Char -> B.ByteString -> [B.ByteString] clean' d = B.split d . B.concat . B.lines
I think that would only work if there was one column per line...I didn't make it clear that as well as being comma separated, the delimiter is around each column, of which there are several on a line so if the delimiter is ~ a file might look like:
~sdlkfj~, ~dsdkjf~ #eo row1 ~sdf dfkj~, ~dfsd~ #eo row 2
I love to see people using Haskell, especially professionally, but I have to wonder if the real tool for this job is sed? :-) Jason

Jason Dagit wrote: [snip]
I love to see people using Haskell, especially professionally, but I have to wonder if the real tool for this job is sed? :-)
Jason
Maybe it is -- I've never used sed. (cue oohs and ahhs from the gallery?) But from the (unquantified) gains so far haskell may certainly be enough of an improvement to fit the bill even though I'd be interested in anything that improved on it further still.

On Jun 15, 2007, at 18:37 , Jason Dagit wrote:
I love to see people using Haskell, especially professionally, but I have to wonder if the real tool for this job is sed? :-)
Actually, while sed could do that, it'd be a nightmare. You really want a parser to deal with general CSV like this, and while you can write parsers in sed, you *really* don't want to. :) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

"Brandon S. Allbery KF8NH"
On Jun 15, 2007, at 18:37 , Jason Dagit wrote:
I love to see people using Haskell, especially professionally, but I have to wonder if the real tool for this job is sed? :-)
Actually, while sed could do that, it'd be a nightmare. You really want a parser to deal with general CSV like this, and while you can write parsers in sed, you *really* don't want to. :)
sed ':a /,$/!{N;s/\n//;ba}' somefile.csv

On Jun 17, 2007, at 14:04 , Pete Kazmier wrote:
"Brandon S. Allbery KF8NH"
writes: On Jun 15, 2007, at 18:37 , Jason Dagit wrote:
I love to see people using Haskell, especially professionally, but I have to wonder if the real tool for this job is sed? :-)
Actually, while sed could do that, it'd be a nightmare. You really want a parser to deal with general CSV like this, and while you can write parsers in sed, you *really* don't want to. :)
sed ':a /,$/!{N;s/\n//;ba}' somefile.csv
If it's simple enough, sure. If you get into dealing with quoting, it gets a lot more complex in a hurry. Of course, given that this is from Oracle, it *probably* doesn't. (But as a sysadmin and ex-dba, "probably" is worrisome....) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On 15/06/07, Jim Burton
Sebastian Sylvan wrote:
On 15/06/07, Jim Burton
wrote: [snip] Hi, Hi Sebastian, I haven't compiled this, but you get the general idea: import qualified Data.ByteString.Lazy.Char8 as B -- takes a bytestring representing the file, concats the lines -- then splits it up into "real" lines using the delimiter clean :: Char -> B.ByteString -> [B.ByteString] clean' d = B.split d . B.concat . B.lines
I think that would only work if there was one column per line...I didn't make it clear that as well as being comma separated, the delimiter is around each column, of which there are several on a line so if the delimiter is ~ a file might look like:
~sdlkfj~, ~dsdkjf~ #eo row1 ~sdf dfkj~, ~dfsd~ #eo row 2
A sorry, I thought the delimiter was a line delimiter. I'm trying to get to that fusion goodness by using built-in functions as much as possible... How about this one: clean del = B.map ( B.filter (/='\n') ) . B.groupBy (\x y -> (x,y) /= (del,'\n')) That groupBy will group it into groups which don't have the delimiter followed by a newline in them (which is the sequence your rows end with), then it filters out newlines in each row. You might want to filter out spaces first (if there are any) so that you don't get a space between the delimiter and newline at the end... -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862

Sebastian Sylvan wrote:
A sorry, I thought the delimiter was a line delimiter. I'm trying to get to that fusion goodness by using built-in functions as much as possible...
How about this one:
clean del = B.map ( B.filter (/='\n') ) . B.groupBy (\x y -> (x,y) /= (del,'\n'))
That groupBy will group it into groups which don't have the delimiter followed by a newline in them (which is the sequence your rows end with), then it filters out newlines in each row. You might want to filter out spaces first (if there are any) so that you don't get a space between the delimiter and newline at the end...
I think you still need unlines after that so is the time complexity different to the unlines . foldr (function including `last') . lines in my first post? Or is it better for another reason, such as "fusion goodness"?

On 16/06/07, Jim Burton
Sebastian Sylvan wrote:
A sorry, I thought the delimiter was a line delimiter. I'm trying to get
that fusion goodness by using built-in functions as much as possible...
How about this one:
clean del = B.map ( B.filter (/='\n') ) . B.groupBy (\x y -> (x,y) /= (del,'\n'))
That groupBy will group it into groups which don't have the delimiter followed by a newline in them (which is the sequence your rows end with), then it filters out newlines in each row. You might want to filter out spaces first (if there are any) so that you don't get a space between
to the
delimiter and newline at the end...
I think you still need unlines after that so is the time complexity different to the
unlines . foldr (function including `last') . lines
in my first post? Or is it better for another reason, such as "fusion goodness"?
Benchmark it I guess :-) Both versions use a non-bytestring recursive functions (the outer B.mapshould just be a straight map, and yours use a foldr), which may mess fusion up... Not sure what would happe here... I don't have a Haskell compiler at this computer so I can't try anything out... -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862

On 6/15/07, Sebastian Sylvan
Benchmark it I guess :-) Both versions use a non-bytestring recursive functions (the outer B.map should just be a straight map, and yours use a foldr), which may mess fusion up... Not sure what would happe here... I don't have a Haskell compiler at this computer so I can't try anything out...
I just remembered this recent thread about fast bytestring parsing: http://www.nabble.com/Fast-number-parsing-with-strict-bytestrings--Was%3A-Re... Perhaps there is an idea or two that can be applied here? Jason

On Fri, Jun 15, 2007 at 11:31:36PM +0100, Jim Burton wrote:
I think that would only work if there was one column per line...I didn't make it clear that as well as being comma separated, the delimiter is around each column, of which there are several on a line so if the delimiter is ~ a file might look like:
~sdlkfj~, ~dsdkjf~ #eo row1 ~sdf dfkj~, ~dfsd~ #eo row 2
It would be easier to experiment if you could provide us with an example input file. If you are worried about revealing sensitive information, you can change all characters other then newline, ~ and , to "A"s, for example. An accompanying output file, for checking correctness, would be even nicer. Best regards Tomek

Tomasz Zielonka wrote:
On Fri, Jun 15, 2007 at 11:31:36PM +0100, Jim Burton wrote:
I think that would only work if there was one column per line...I didn't make it clear that as well as being comma separated, the delimiter is around each column, of which there are several on a line so if the delimiter is ~ a file might look like:
~sdlkfj~, ~dsdkjf~ #eo row1 ~sdf dfkj~, ~dfsd~ #eo row 2
It would be easier to experiment if you could provide us with an example input file. If you are worried about revealing sensitive information, you can change all characters other then newline, ~ and , to "A"s, for example. An accompanying output file, for checking correctness, would be even nicer.
Hi Tomasz, I can do that but they do essentially look like the example above, except with 10 - 30 columns, more data in each column, and more rows, maybe this side of a million. They are produced by an Oracle export which escapes the delimiter (often a tilde) from within the cols. The output file should have exactly one row per line, with extra newlines replaced by a string given as a param (it might be a space or a html tag -- I only just remembered this and my initial effort doesn't do it). Thanks, Jim
Best regards Tomek

On Sat, Jun 16, 2007 at 12:08:22PM +0100, Jim Burton wrote:
Tomasz Zielonka wrote:
It would be easier to experiment if you could provide us with an example input file. If you are worried about revealing sensitive information, you can change all characters other then newline, ~ and , to "A"s, for example. An accompanying output file, for checking correctness, would be even nicer.
Hi Tomasz, I can do that but they do essentially look like the example above, except with 10 - 30 columns, more data in each column, and more rows, maybe this side of a million. They are produced by an Oracle export which escapes the delimiter (often a tilde) from within the cols. The output file should have exactly one row per line, with extra newlines replaced by a string given as a param (it might be a space or a html tag -- I only just remembered this and my initial effort doesn't do it).
I guess you've tried to convince Oracle to produce the right format in the first place, so there would be no need for post-processing...? I wonder what would you get if you set the delimiter to be a newline ;-) Best regards Tomek

Tomasz Zielonka wrote:
I guess you've tried to convince Oracle to produce the right format in the first place, so there would be no need for post-processing...?
We don't control that job or the first db.
I wonder what would you get if you set the delimiter to be a newline ;-)
eek! ;-)
Best regards Tomek
participants (7)
-
Brandon S. Allbery KF8NH
-
Jason Dagit
-
Jim Burton
-
Pete Kazmier
-
Sebastian Sylvan
-
Thomas Schilling
-
Tomasz Zielonka