
Time to write actual code and do some measurements :) The code attached at the end of the message gets compiled with -O2. Writing a sample test file happens with (writeTest #columns #rows) like in writeTest 4 500000 (~7 seconds, ~770MB heap (=:-o), 3MB test file). I assume the heap spaces from writeTest are everything added together as 'top' does not report any memory bursts. The following matrix represents the performance comparisons between four possibilities. The times get produced by calling (test #matrixrow (filtertest #matrixcol)) map transpose ++[nl] 2.76,2.87,2.83 2.72,2.88,2.96
2.72,2.92,2.87 2.82,2.79,2.88
No significant differences. In a test with more rows, >> seems to perform slightly better (as expected). Transpose is a bit better, too: writeTest 10 750000 (~24 seconds, ~2.8GB heap (=:-o), 15MB test file) map transpose ++[nl] 3.50,3.59,3.42 3.23,3.26,3.29,3.19
3.38,3.41,3.41 3.19,3.14,3.23
Looks like my measurements somewhat disagree with yours. Odd. But note that by discriminating the to be tested functionality on run-time, the compiler gets no chance to optimize things for the particular case. So in reality, (++[nl]) could trigger a good code transformation whereas (>>) does not. Also note that transpose is very lazy and is far cheaper than it looks. Somehow, the 2.8 and 3.5 seconds are not in proportion with respect to the inputs of 3MB and 15MB or the outputs of 590KB and 400KB (yes, the smaller input produces a larger output). Your 13 seconds versus 90 seconds makes this even more puzzling. But it looks like writing a CSV file is far more expensive than reading one. Maybe it's not a good idea to call hPut very often.
the mask (map (`elem` tags) cols) is only computed once (shouldn't the compiler do that automatically since the expression is constant?) [...]
col x cols row = row !! i where Just i = lookup x $ zip cols [0..]
One has to be careful, col x cols = \row -> row !! i where Just i = lookup x $ zip cols [0..] is different as it shares i across all rows. The compiler is likely not to do this easy transformation ("full laziness" transformation), for col because this can introduce space leaks. These are things the programmer should have control over, so no optimization here. See also http://haskell.org/haskellwiki/GHC/FAQ#When_can_I_rely_on_full_laziness.3F I think the reason given there is wrong, it's not about efficiency but about space leaks. The map showcase suggests that (map (`elem` tags) cols) is computed only once, though personally, I don't rely on that (yet). Regards, apfelmus
---------------------------------------- module CSV where
import qualified Data.ByteString.Lazy.Char8 as B import Data.List import System.IO
{------------------------------------------------------------------------------- Reading and writing CSV (comma separated value) files --------------------------------------------------------------------------------}
readCSV :: FilePath -> IO [[B.ByteString]] readCSV file = do v <- B.readFile file return $ map (B.split ',') $ B.lines v
writeCSV :: Int -> FilePath -> [[B.ByteString]] -> IO () writeCSV i file tbl = do h <- openFile file WriteMode mapM_ (writeRow i h) tbl hClose h
writeRow j = case j of 1 -> \h -> mapM_ (B.hPut h) . (++ [nl]) . intersperse comma 2 -> \h row -> (mapM_ (B.hPut h) $ intersperse comma row) >> B.hPut h nl where comma = B.singleton ',' nl = B.singleton '\n'
{------------------------------------------------------------------------------- Processing [[ByteString]] --------------------------------------------------------------------------------} select j targs test (cols : rows) = narrow $ cols : filter (test cols) rows where narrow = colmap j (map snd . filter fst . zip mask) mask = map (`elem` targs) cols
colmap :: Int -> (forall a . [a] -> [a]) -> [[a]] -> [[a]] colmap 1 f = map f colmap 2 f = transpose . f . transpose
col x cols = \row -> row !! i where Just i = lookup x $ zip cols [0..]
if12 = ((== B.pack "2") .) . col (B.pack "1") filtertest j = select j (map B.pack ["1","2","4"]) if12
test i f = readCSV "test.csv" >>= writeCSV i "result.csv" . f
{------------------------------------------------------------------------------- Test cases --------------------------------------------------------------------------------} rotated :: Int -> [[B.ByteString]] rotated n = map (take n) . iterate (drop n) . concat . repeat . map (B.pack . show) $ [1..(n+1)]
writeTest c r = writeCSV 1 "test.csv" . take r . rotated $ c