External Sort: Sort a 10-million integer file with just 256M of ram.

Now on hackage: cabal install external-sort demo (included in distribution): thartman@thartman-laptop:~/external-sort>cat demo.hs {-# LANGUAGE PatternSignatures #-} import Algorithms.ExternalSort import Data.List import System.IO import System.Environment (getArgs) import System.Time import HSH -- to do: compare speed against unix sort util on a 10 million line file. -- pure in-memory prelude sort will crash your computer when the list gets over a million elements or so -- externalsort caches the sublists used in the sort algorithm on your hard drive, so you can sort a much larger list. {- The behavior below was on a demo executable, compiled. (In ghci, even last on a 10 million element list caused an out of memory error.) The test computer had 256M physical ram and was ulimited to 256M cache. *Main>:! ulimit -v 262144 For 10 million element list: *Main> :! time ./demo preludesort 7 demo: out of memory (requested 1048576 bytes) Command exited with non-zero status 1 4.88user 0.68system 0:21.11elapsed 26%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+64817minor)pagefaults 0swaps *Main> :! time ./demo externalsort 7 10000000 73.87user 1.96system 1:24.25elapsed 90%CPU (0avgtext+0avgdata 0maxresident)k 792inputs+156280outputs (6major+16739minor)pagefaults 0swaps ./demo unixsort 7 wrote bigfile, time: Mon Oct 20 15:25:26 CEST 2008 demo: out of memory (requested 1048576 bytes) For 100 million element list, external sort failed. Can it be made to work? maybe in some future version. time ./demo externalsort 8 demo: out of memory (requested 1048576 bytes) real 10m14.061s user 8m26.712s sys 0m11.793s thartman@thartman-laptop:~/external-sort>ls -lh ExternalSort.bin -rw-r--r-- 1 thartman thartman 764M Oct 20 15:50 ExternalSort.bin The problem is not fitting a 10^8 element list in memory, the following works fine (when compiled, though not in ghci): t = putStrLn . show . last $ [1..10^8::Int] Maybe think about this more later. -} main = do [s,e] <- getArgs let exp = read e case s of "preludesort" -> sortwith exp $ return . sort "externalsort" -> sortwith exp externalSort "unixsort" -> unixsort exp _ -> let msg = "usage: ./demo preludesort 7 or ./demo externalsort 7 or ./demo unixsort 7 \ \(sort 10 million element list)" in fail msg sortwith exp s = putStrLn =<< return . show . last =<< s ([1..10^exp ::Int]) unixsort exp = do let fn = "bigfile" withFile fn AppendMode (\h -> (mapM_ (hPutStrLn h . show) ([1..10^exp::Int]) ) ) putStrLn . ( ("wrote " ++ fn ++ ", time: ") ++ ) . show =<< getClockTime run $ "time tail -n1 | sort " ++fn :: IO String return () t = putStrLn . show . last $ [1..10^8::Int]

Hello Thomas, Thursday, October 23, 2008, 8:41:04 PM, you wrote:
The problem is not fitting a 10^8 element list in memory, the following works fine (when compiled, though not in ghci): t = putStrLn . show . last $ [1..10^8::Int]
this runs in 1k space, thanks to lazy evaluation. 10^8-length list needs ~3gb of memory, it was discussed just a few days ago -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Thomas,
Thursday, October 23, 2008, 8:41:04 PM, you wrote:
The problem is not fitting a 10^8 element list in memory, the following works fine (when compiled, though not in ghci): t = putStrLn . show . last $ [1..10^8::Int]
this runs in 1k space, thanks to lazy evaluation. 10^8-length list needs ~3gb of memory, it was discussed just a few days ago
To elaborate, t does this: compute the next item in the list, throw the previous item away, until there is no next item, now we have something to print. We never keep the whole list. But this may keep the whole list: u = (putStrLn . show . last $ list) >> (putStrLn . show . head $ list) where list = [1..10^8::Int] Have fun!

Hello Albert, Saturday, October 25, 2008, 9:02:14 PM, you wrote:
u = (putStrLn . show . last $ list) >> (putStrLn . show . head $ list) where list = [1..10^8::Int]
i prefer to write it as main = do let list = [1..10^8] print (last list) print (head list) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin
Hello Albert,
Saturday, October 25, 2008, 9:02:14 PM, you wrote:
u = (putStrLn . show . last $ list) >> (putStrLn . show . head $ list) where list = [1..10^8::Int]
i prefer to write it as
main = do let list = [1..10^8] print (last list) print (head list)
Don't you slam pointless style! -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.

"Felipe Lessa"
On Mon, Oct 27, 2008 at 6:47 AM, Achim Schneider
wrote: Don't you slam pointless style!
main = mapM_ (($ [1..10^8::Int]) . (.) (putStrLn . show)) [last, head]
=)
Hmmm... Template Haskell... repetitive code... deficiency... -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.
participants (5)
-
Achim Schneider
-
Albert Y. C. Lai
-
Bulat Ziganshin
-
Felipe Lessa
-
Thomas Hartman