Hello,I tried to load a ~50MB csv file in memory with cassava but found that my program was incredibly slow. After doing some profiling, I realized that it used an enormous amount of memory in the heap:
24,626,540,552 bytes allocated in the heap
6,946,460,688 bytes copied during GC
2,000,644,712 bytes copied maximum residency (14 sample(s))
319,728,944 bytes maximum slop
3718 MB total memory in use (0MB lost due to fragmentation)
...
%GC time 84.0% (94.3% elapsed)
Seeing that, I have the feeling that my program lacks strictness and accumulates thunks in memory. I tried two versions, on using Data.Csv, and one using Data.Csv.Streaming. Both are giving the same result. What am I doing wrong?
Here are the two sources:
1/
import Data.Csv
import Data.ByteString.Lazy as BL
import qualified Data.Vector as V
main :: IO ()
main = do
csv <- BL.readFile "tt.csv"
let !res = case decode NoHeader csv of Right q -> q :: V.Vector(V.Vector(ByteString))
print $ res V.! 0
--------------------------------
2/
import Data.Csv.Streaming
import Data.ByteString.Lazy as BL
import qualified Data.Vector as V
import Data.Foldable
main :: IO ()
main = do
csv <- BL.readFile "tt.csv"
let !a = decode NoHeader csv :: Records(V.Vector(ByteString))
let !xx = V.fromList $ [V.fromList([])] :: V.Vector(V.Vector(ByteString))
let !res = Data.Foldable.foldr' V.cons xx a
print $ res V.! 0
The goal of the program is ultimately to have the csv loaded in memory as a Vector of Vector of ByteString for further processing later on.
Thank you for your help,
Antoine