
This Chrome plugin helps avoid the Hackage version problem btw:
https://chrome.google.com/webstore/detail/hackage-fu/dnpldbohleinhdgfnhlkofp...
On Mon, May 12, 2014 at 1:32 PM, Bob Ippolito
I don't know why I ended up looking at an old version of Data.Text but the copy function has been around for a year, since 0.11.3.0. http://hackage.haskell.org/package/text-1.1.1.2/docs/Data-Text.html#copy– you'll want to use it for this use case.
On Mon, May 12, 2014 at 9:54 AM, Bob Ippolito
wrote: I haven't looked closely but I suspect if you use foldM to build your Map rather than untilM it might consume less memory since you won't be allocating this big list.
BUT… the real reason why the memory usage is so much higher than you expect is because slicing a Data.Text is an O(1) operation that shares the underlying buffer between the original and the slice. Calling T.words will ensure that the full original line stays around. You can see this in the implementation:
http://hackage.haskell.org/package/text-0.11.2.0/docs/src/Data-Text.html#wor...
This behavior is documented somewhat near there in the docs, but I think it should really be a top-level thing: http://hackage.haskell.org/package/text-0.11.2.0/docs/Data-Text.html#g:18
I don't know what function to use to force the array to be copied, hopefully there is one! Erlang's binaries work similarly and there is a copy function for them for exactly this purpose: http://www.erlang.org/doc/man/binary.html
-bob
On Sun, May 11, 2014 at 4:24 AM, Jan Snajder
wrote: Dear all,
I'm trying to implement a simple file-based database. I apparently have a space leak, but I have no clue where it comes from.
Here's the file-based database implementation: http://pastebin.com/QqiqcXFw
The idea to have a database table in a single textual file. One line equals one table row. The fields within a row are whitespace separated. The first field is the key. Because I'd like to work with large files, I don't want to load the whole file into memory. Instead, I'd like to be able to fetch the rows on demand, by keys. Thus I first create an index that links keys to file seeks. I use the readerT to add the index to the IO monad.
For testing, I use a dummy table produced as follows:
import System.IO import Text.Printf import Control.Monad
row = unwords [printf "field%03d" (i::Int) | i <- [1..999]]
main = do forM_ [1..250000] $ \i -> putStrLn $ printf "row%06d %s" (i::Int) row
This generates a 2.1G textual file, which I store on my disk.
The testing code:
import FileDB import qualified Data.Text as T import Text.Printf import Control.Applicative import Control.Monad import Control.Monad.Trans import System.IO import System.Environment
main = do (f:_) <- getArgs t <- openTable f runDB t $ do ks <- getKeys liftIO $ do putStrLn . printf "%d keys read" $ length ks putStrLn "Press any key to continue..." getChar forM_ ks $ \k -> do Just r <- getRow k liftIO . putStrLn $ printf "Row \"%s\" has %d fields" (T.unpack k) (length r)
When I run the test on the 2.1GB file, the whole program consumes 10GB.
6GB seem to be allocated after the index is built (just before entering the forM_ function). The remaining 4GB are allocated while fetching all the rows.
I find both things difficult to explain.
6GB seems too much for the index. Each key is 9 characters (stored as Data.Text), and I have 250K such keys in a Data.Map. Should this really add up to 6GB?
Also, I have no idea why fetching all the rows, one by one, should consume any additional memory. Each row is fetched and its length is computed and printed out. I see no reason for the rows to be retained in the memory.
Here's the memory allocation summary:
1,093,931,338,632 bytes allocated in the heap 2,225,144,704 bytes copied during GC 4,533,898,000 bytes maximum residency (26 sample(s)) 3,080,926,336 bytes maximum slop 10004 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 2171739 colls, 0 par 45.29s 45.26s 0.0000s 0.0030s Gen 1 26 colls, 0 par 1.50s 1.53s 0.0589s 0.7087s
INIT time 0.00s ( 0.00s elapsed) MUT time 279.92s (284.85s elapsed) GC time 46.80s ( 46.79s elapsed) EXIT time 0.68s ( 0.71s elapsed) Total time 327.40s (332.35s elapsed)
%GC time 14.3% (14.1% elapsed)
Alloc rate 3,908,073,170 bytes per MUT second
Productivity 85.7% of total user, 84.4% of total elapsed
Btw., I don't get the "bytes allocated in the heap" figure, which is approx. 1000 GB (?).
I'm obviously doing something wrong here. I'd be thankful for any help.
Best, Jan _______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners