
Greg Buchholz wrote:
True. But there are some tests like "fasta" that appear to have a laziness induced space leak that presumably could be fixed.
http://shootout.alioth.debian.org/benchmark.php?test=fasta&lang=all
Playing with the space-leaking code did not show any obvious reason to expect the space leak. By commenting out the third output in main it did not leak space. Could some look at the original code at the link above and let me know why there is a 300 MB space leak? But if you never express huge lists, then it won't leak space. Here is a tweaked version, called tweak6.hs, that interleaves making and writing a line of output. It runs in 2.16 MB of RSIZE instead of over 300 MB. Still slow, however. If you want more speed without extra libraries, then use an unboxed array of Word8 (length of one line: 60 bytes), since Unicode Char and linked lists are overkill. The other killer is probably the array of tuple linear lookup to find the chosen symbol. Hand coded if/then branches could be used to speed that up. For even faster speed, do not use 0.0 to 1.0 probabilities at all. Instead of normalize :: Int -> Double and then lookup (Double -> Base) remove the use of Double and do (Int -> Base) lookup. But the space leak was the embarrassing part. So I just fixed that. -- Chris Kuklewicz --------------- -- The Great Computer Language Shootout -- http://shootout.alioth.debian.org/ -- contributed by Jeff Newbern -- Modified to tweak6.hs by Chris Kuklewicz import Control.Monad.Trans import Control.Monad.State import System.IO (hFlush,stdout) {- original $ time ./fasta.ghc_run 2500000 > /dev/null real 1m17.596s user 0m44.838s sys 0m3.322s 330+ Megs! Space leaking all over the place. No idea why. $ time ./fasta.tweak6 2500000 > /dev/null real 0m30.477s user 0m23.356s sys 0m0.744s 2.2 Megs RSIZE -} -- Uses random generation code derived from Simon Marlow and Einar Karttunen's -- "random" test entry. -- Orignal version note: This code has not been optimized *at all*. -- It is written to be clear and to follow standard Haskell idioms as -- much as possible (but we have to match the stateful PRNG idiom in -- the test definition, which is oriented toward an imperative -- language). Performance is decent with ghc -O2, but if it can be -- improved without sacrificing the clarity of the code, by all means -- go for it! -- Mondified tweak6.hs version note: Use a StateT around IO to manage -- the seed. This makes interleaving the generation and output of the -- sequence easier. It is only *minimally* abstracted. import System(getArgs) type Base = Char type Sequence = [Base] alu :: Sequence -- predefined sequence alu = "GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGG" ++ "GAGGCCGAGGCGGGCGGATCACCTGAGGTCAGGAGTTCGAGA" ++ "CCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAAT" ++ "ACAAAAATTAGCCGGGCGTGGTGGCGCGCGCCTGTAATCCCA" ++ "GCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGG" ++ "AGGCGGAGGTTGCAGTGAGCCGAGATCGCGCCACTGCACTCC" ++ "AGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAA" type BaseFrequency = (Base,Double) iub :: [BaseFrequency] iub = [ ('a', 0.27), ('c', 0.12), ('g', 0.12), ('t', 0.27), ('B', 0.02), ('D', 0.02), ('H', 0.02), ('K', 0.02), ('M', 0.02), ('N', 0.02), ('R', 0.02), ('S', 0.02), ('V', 0.02), ('W', 0.02), ('Y', 0.02) ] homosapiens :: [BaseFrequency] homosapiens = [ ('a', 0.3029549426680), ('c', 0.1979883004921), ('g', 0.1975473066391), ('t', 0.3015094502008) ] -- select a base whose interval covers the given double chooseBase :: [BaseFrequency] -> Double -> Base chooseBase [(b,_)] _ = b chooseBase ((b,f):xs) p | p < f = b | otherwise = chooseBase xs (p-f) type Seed = Int type Pseudo a = StateT Seed IO a prng :: Pseudo Double prng = let nextSeed s = (s * ia + ic) `mod` im normalize n = (fromIntegral n) * (1.0 / fromIntegral im) im = 139968; ia = 3877; ic = 29573 in do seed <- get let seed' = nextSeed seed put seed' return (normalize seed') prngN count = sequence $ replicate count prng -- write a sequence in Fasta format writeFasta :: String -> String -> Sequence -> IO () writeFasta label title sequence = do putStrLn $ ">" ++ (label ++ (" " ++ title)) writeWrapped 60 sequence where writeWrapped _ [] = do return () writeWrapped len str = do let (s1,s2) = splitAt len str putStrLn s1 writeWrapped len s2 writeFastaHeader :: (MonadIO m) => String -> String -> m () writeFastaHeader label title = liftIO $ putStrLn $ ">" ++ (label ++ (" " ++ title)) writeWrapped' :: Int -> Int -> (Double->Base) -> Pseudo () writeWrapped' wrap total trans = let work c = case c of 0 -> return () n -> do let c' = min wrap n nextC = c - c' s <- liftM (map trans) (prngN c') liftIO $ putStrLn s work nextC in work total writeWrapped = writeWrapped' 60 main = do args <- getArgs let n = if (null args) then 1000 else read (head args) writeFasta "ONE" "Homo sapiens alu" (take (2*n) (cycle alu)) writeFastaHeader "TWO" "IUB ambiguity codes" seed' <- execStateT (writeWrapped (3*n) (chooseBase iub)) 42 writeFastaHeader "THREE" "Homo sapiens frequency" execStateT (writeWrapped (5*n) (chooseBase homosapiens)) seed'