
jeff:
I switched to Data.Binary, which dropped me from 2.6GB to 1.5GB, and then I switched this afternoon to unboxed arrays from lists of floats, and that dropped me again from 1.5GB to 475MB. I think, all told, that I'm in an acceptable range now, and thank you for pointing out the library mistake. I'm also down from 1.5 minutes load time to under 10 seconds of load time, which is also very very nice. Incidentally, the code I'm now using is:
Good!
binaryLoadDocumentCoordinates :: String -> IO (Ptr Float, Array.UArray Int Int) binaryLoadDocumentCoordinates path = do putStrLn "File opened" coordinates <- decodeFile (path ++ "/Clusters.bin") :: IO (Array.UArray Int Float) print . Array.bounds $ coordinates putStrLn "Got coordinates" galaxies <- decodeFile (path ++ "/Galaxies.bin") :: IO (Array.UArray Int Int) putStrLn "Got galaxies" coordinatesArr <- mallocArray . snd . Array.bounds $ coordinates putStrLn "Allocated array" pokeArray coordinatesArr . Array.elems $ coordinates return (coordinatesArr, galaxies)
binarySaveDocumentCoordinates :: String -> [Point] -> IO () binarySaveDocumentCoordinates path points = do let len = length points encodeFile (path ++ "Clusters.bin") . (Array.listArray (0,len*3) :: [Float] -> Array.UArray Int Float) . coordinateList . solve $ points encodeFile (path ++ "Galaxies.bin") . (Array.listArray (0,len) :: [Int] -> Array.UArray Int Int) . galaxyList $ points
You could improve this further by removing the intermediate list serialisation and construction required by the UArray instance. My previous example did that, using Ptr Int arrays instead of UArray, which can then be serialised by casting to a bytestring, and writing those bytes directly, rather than serialising via lists, as UArrays do. Removing the UArray serialisation means no allocation overhead at all to serialise the Int array. That final step would reduce the memory overhead to exactly the size of the input file, and probably shave at least 50% off the time, if you need to further improve it. That is, you'd have: binaryLoadDocumentCoordinates :: String -> IO (Ptr Float, IntTable) where IntTable is a shallow wrapper over Ptr Int, serialised with the techniques used here, http://haskell.org/haskellwiki/Serialisation_and_compression_with_Data_Binar... newtype IntTable = IntTable (Ptr Int) with a suitable Binary instance (possibly compressing it on the fly too). The general lesson is to avoid lists of any kind (including those constructed implicitly when serialising Arrays), as soon as you have more than 1M or so of data stored in those lists. -- Don