New subject: Sneaking haskell in the workplace -- cleaning csv files

15 Jun 2007

      I need to remove newlines from csv files (within columns, not at the end of
entire lines). This is prior to importing into a database and was being done
at my workplace by a java class for quite a while until the files processed
got bigger and it proved to be too slow. (The files are up to ~250MB at the
moment) It was rewritten in PL/SQL, to run after the import, which was an
improvement, but it still has our creaky db server thrashing away. (You may
have lots of helpful suggestions in mind, but we can't clean the data at
source and AFAIK we can't do it incrementally because there is no timestamp
or anything on the last change to a row from the legacy db.) 

We don't need a general solution - if a line ends with a delimiter we can be
sure it's the end of the entire line because that's the way the csv files
are generated.

I had a quick go with ByteString (with no attempt at robustness etc) and
although I haven't compared it properly it seems faster than what we have
now. But you can easily make it faster, surely! Hints for improvement please
(e.g. can I unbox anything, make anything strict, or is that handled by
ByteString, is there a more efficient library function to replace the
fold...?).

module Main
    where
import System.Environment (getArgs)
import qualified Data.ByteString.Char8 as B

--remove newlines in the middle of 'columns'
clean :: Char -> [B.ByteString] -> [B.ByteString]
clean d = foldr (\x ys -> if B.null x || B.last x == d then x:ys else
(B.append x $ head ys):(tail ys)) [] 

main = do args <- getArgs
          if length args < 2
           then putStrLn "Usage: crunchFile INFILE OUTFILE [DELIM]"
           else do bs <- B.readFile (args!!0)
                   let d = if length args == 3 then head (args!!2) else '"'
                   B.writeFile (args!!1) $ (B.unlines . clean d . B.lines)
bs

Thanks,

Jim
-- 
View this message in context: http://www.nabble.com/Sneaking-haskell-in-the-workplace----cleaning-csv-file...
Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.

Sneaking haskell in the workplace -- cleaning csv files

Jim Burton

Thomas Schilling

Jim Burton

Jason Dagit

Thomas Schilling

Sebastian Sylvan

Jim Burton

Jason Dagit

Jim Burton

Brandon S. Allbery KF8NH

Pete Kazmier

Brandon S. Allbery KF8NH

Sebastian Sylvan

Jim Burton

Sebastian Sylvan

Jason Dagit

Tomasz Zielonka

Jim Burton

Tomasz Zielonka

Jim Burton

tags

participants (7)