
I am trying to use Haskell as a replacement for R but running into two problems which I describe below. Are there any plans to address the performance issues I have encountered? 1. I seem to have to jump through a lot of hoops just to be able to select the data I am interested in. {-# LANGUAGE ScopedTypeVariables #-} {-# OPTIONS_GHC -Wall #-} import Data.Csv hiding ( decodeByName ) import qualified Data.Vector as V import Data.ByteString ( ByteString ) import qualified Data.ByteString.Char8 as B import qualified Pipes.Prelude as P import qualified Pipes.ByteString as Bytes import Pipes import qualified Pipes.Csv as Csv import System.IO import qualified Control.Foldl as L main :: IO () main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h -> do let csvs :: Producer (V.Vector ByteString) IO () csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat uvectors :: Producer (V.Vector ByteString) IO () uvectors = csvs >-> P.map (V.foldr V.cons V.empty) vec_vec <- L.impurely P.foldM L.vector uvectors print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17 print $ V.length vec_vec let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec print $ V.length rockspring Here's the equivalent R: df <- read.csv("787338586_T_ONTIME.csv") rockspring <- df[df$ORIGIN == "RKS",] 2. Now I think I could improve the above to make an environment that is more similar to the one my colleagues are used to in R but more problematical is the memory usage. * 112.5M file * Just loading the source into ghci takes 142.7M * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo takes me up to 4.75G. But we probably don't want to do this! * Let's try again. * > :set -XScopedTypeVariables * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat * > let uvectors :: Producer (V.Vector ByteString) IO () = csvs >-> P.map (V.map id) >-> P.map (V.foldr V.cons V.empty) * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely P.foldM L.vector uvectors * Now I am up at 3.17G. In R I am under 221.3M. * > V.length rockspring takes a long time to return 155 and now I am at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems instantaneous and now uses only 379.5M. * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e. there are 37 columns and 471,949 rows. Running this as an executable gives ~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s ["2014-01-01","EV","20366","N904EV","2512","10747","1074702","30747", "BRO","Brownsville, TX","Texas","11298","1129803","30194", "DFW","Dallas/Fort Worth, TX","Texas","0720","0718", "-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00", "","0.00","482.00","","","","","",""] 471949 155 14,179,764,240 bytes allocated in the heap 3,378,342,072 bytes copied during GC 786,333,512 bytes maximum residency (13 sample(s)) 36,933,976 bytes maximum slop 1434 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 26989 colls, 0 par 1.423s 1.483s 0.0001s 0.0039s Gen 1 13 colls, 0 par 1.005s 1.499s 0.1153s 0.6730s INIT time 0.000s ( 0.003s elapsed) MUT time 3.195s ( 3.193s elapsed) GC time 2.428s ( 2.982s elapsed) EXIT time 0.016s ( 0.138s elapsed) Total time 5.642s ( 6.315s elapsed) %GC time 43.0% (47.2% elapsed) Alloc rate 4,437,740,019 bytes per MUT second Productivity 57.0% of total user, 50.9% of total elapsed