External Sort and unsafeInterleaveIO

hi folks -- a haskell newbie here, searching for comments and wisdom on my code. i had a project to try to implement "external sort" in haskell as a learning exercise. (external sort is sorting a list that is too large to fit in main memory, by sorting in chunks, spooling to disk, and then merging. more properly there probably should be multiple stages, but for simplicity i'm doing a one-stage external sort.) the trick is the number of files can quickly grow very large, so it is best to use one large file and seek inside it around. however as one can imagine the order-of-IO-operations becomes a bit tricky, if you're seeking file handles around underneath Data.ByteString.Lazy's nose. but late this night after not thinking about it for a while i had a brainstorm: rewrite hGetContents to keep the handle position in the right place! it's all about judicious use of unsafeInterleaveIO..... it seems to be rather fast, strangely faster than the usual "sort" at times. it also seems to have nice memory characteristics, though not perfect. it's hard to test because the normal "sort" function takes too much RAM on large lists, making my computer swap like mad. i'd appreciate any testing, comments and suggestions from the haskell gods out there. my thanks to the Data.ByteString.Lazy, Data.Binary, and Data.Edison people, who made this rather easy, once I grokked unsafeInterleaveIO. thanks and take care, B
module ExternalSort where
Sort a list of Ords "offline." We're doing this to be able to sort things without taking up too much memory (for example sorting lists too large to fit in RAM.) Laziness is imperative, as is the order-of-operations.
import Control.Monad import Data.List import qualified Data.Binary as Bin import qualified Data.ByteString.Lazy as B import qualified Data.ByteString as P (hGetNonBlocking, null) import Data.ByteString.Base (LazyByteString(LPS)) import Foreign.Storable (sizeOf) import System.IO (openFile, hClose, hSeek, hTell, hIsEOF, hWaitForInput, Handle, IOMode(ReadMode, WriteMode), SeekMode(AbsoluteSeek)) import System.IO.Unsafe (unsafeInterleaveIO)
import qualified Data.Edison.Seq.ListSeq as LS import qualified Data.Edison.Coll.SplayHeap as Splay
Conceptually, we sort a list in blocks, spool blocks to disk, then merge back. However for IO performance it is better to read off chunks of elements off the sorted blocks from disk instead of elements-at-a-time. It would be better if these were in KBytes instead of # of elements.
blocksize :: Int blocksize = 10000
Turn a list into a list of chunks.
slice :: Int -> [a] -> [[a]] slice _ [] = [] slice size l = (take size l) : (slice size $ drop size l)
Turn a list into a list of blocks, each of which is sorted.
blockify :: (Ord a) => Int -> [a] -> [[a]] blockify bsize l = map sort $ slice bsize l
Serialize a block, returning the (absolute) position of the start.
dumpBlock :: (Ord a, Bin.Binary a) => Handle -> [a] -> IO Integer dumpBlock h b = do start <- hTell h B.hPut h $ Bin.encode b return start
The actual sorting function. We blockify the list, turning it into a list of sorted blocks, and spool to disk, keeping track of offsets. We then read back the blocks (lazily!), and merge them.
externalSort [] = do return [] externalSort l = do h <- openFile "ExternalSort.bin" WriteMode idx <- mapM (\x -> dumpBlock h x) (blockify blocksize l) hClose h h <- openFile "ExternalSort.bin" ReadMode blocks <- mapM (\x -> do {bs <- hGetContentsWithCursor h x; return $ Bin.decode bs}) idx return (kMerge $ blocks)
Merging chunks. K-way merge (and in fact external sort in general) is detailed in Knuth, where he recommends tournament trees. The easiest thing is to probably use one of Okasaki's heaps. I'll use splay heaps, because I don't know any better. It would be better if I changed Ord for blocks to only check the first element.
kMerge :: (Ord a) => [[a]] -> [a] kMerge [] = [] kMerge l = let h = Splay.fromSeq l in kM (Splay.minElem h) (Splay.deleteMin h) where kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a] kM l h | h == Splay.empty = l | otherwise = let next = Splay.minElem h (f, b) = span (\x -> x <= head next) l in f ++ (kM next (if null b then Splay.deleteMin h else (Splay.insert b $ Splay.deleteMin h)))
kMergeSort :: (Ord a) => [a] -> [a] kMergeSort l = kMerge $ blockify blocksize l
This is a version of hGetContents which resets its handle position between reads, so is safe to use with interleaved handle seeking.
hGetContentsWithCursor :: Handle -> Integer -> IO B.ByteString hGetContentsWithCursor = hGetContentsWithCursorN defaultChunkSize
hGetContentsWithCursorN :: Int -> Handle -> Integer -> IO B.ByteString hGetContentsWithCursorN k h start = (lazyRead start) >>= return . LPS where lazyRead start = unsafeInterleaveIO $ loop start
loop start = do hSeek h AbsoluteSeek start ps <- P.hGetNonBlocking h k --TODO: I think this should distinguish EOF from no data available -- the otherlying POSIX call makes this distincion, returning either -- 0 or EAGAIN if P.null ps then do eof <- hIsEOF h if eof then return [] else hWaitForInput h (-1) >> (loop start) else do pos <- hTell h pss <- lazyRead pos return (ps : pss)
defaultChunkSize :: Int defaultChunkSize = 32 * k - overhead where k = 1024 overhead = 2 * sizeOf (undefined :: Int)

Ben wrote:
a haskell newbie here, searching for comments and wisdom on my code.
i had a project to try to implement "external sort" in haskell as a learning exercise. (external sort is sorting a list that is too large to fit in main memory, by sorting in chunks, spooling to disk, and then merging. more properly there probably should be multiple stages, but for simplicity i'm doing a one-stage external sort.)
i'd appreciate any testing, comments and suggestions from the haskell gods out there.
I'm not a god but I like it very much :) Especially because it's laziness in action.
blocks <- mapM (\x -> do {bs <- hGetContentsWithCursor h x; return $ Bin.decode bs}) idx
(Minuscule cosmetics: blocks <- mapM ((liftM Bin.decode) . hGetContentsWithCursor h) idx )
Merging chunks. K-way merge (and in fact external sort in general) is detailed in Knuth, where he recommends tournament trees. The easiest thing is to probably use one of Okasaki's heaps. I'll use splay heaps, because I don't know any better.
It would be better if I changed Ord for blocks to only check the first element.
kMerge :: (Ord a) => [[a]] -> [a] kMerge [] = [] kMerge l = let h = Splay.fromSeq l in kM (Splay.minElem h) (Splay.deleteMin h) where kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a] kM l h | h == Splay.empty = l | otherwise = let next = Splay.minElem h (f, b) = span (\x -> x <= head next) l in f ++ (kM next (if null b then Splay.deleteMin h else (Splay.insert b $ Splay.deleteMin h)))
kMergeSort :: (Ord a) => [a] -> [a] kMergeSort l = kMerge $ blockify blocksize l
Oh, I would have expected a lazy mergesort here. Internally, this will work like a tournament heap. See also http://article.gmane.org/gmane.comp.lang.haskell.cafe/24180 Regards, apfelmus

midfield:
hi folks --
a haskell newbie here, searching for comments and wisdom on my code.
i had a project to try to implement "external sort" in haskell as a learning exercise. (external sort is sorting a list that is too large to fit in main memory, by sorting in chunks, spooling to disk, and then merging. more properly there probably should be multiple stages, but for simplicity i'm doing a one-stage external sort.)
the trick is the number of files can quickly grow very large, so it is best to use one large file and seek inside it around. however as one can imagine the order-of-IO-operations becomes a bit tricky, if you're seeking file handles around underneath Data.ByteString.Lazy's nose. but late this night after not thinking about it for a while i had a brainstorm: rewrite hGetContents to keep the handle position in the right place! it's all about judicious use of unsafeInterleaveIO.....
it seems to be rather fast, strangely faster than the usual "sort" at times. it also seems to have nice memory characteristics, though not perfect. it's hard to test because the normal "sort" function takes too much RAM on large lists, making my computer swap like mad.
I have to agree with Mr. Apfelmus here. This is lovely code. It is exactly what the ByteString team hoped people would be able to write ByteStrings: "Zen of Haskell" code, where you win by working at a high level, rather than a low level. Thanks! I've inserted some small comments though the source:
module ExternalSort where
Sort a list of Ords "offline." We're doing this to be able to sort things without taking up too much memory (for example sorting lists too large to fit in RAM.) Laziness is imperative, as is the order-of-operations.
import Control.Monad import Data.List import qualified Data.Binary as Bin import qualified Data.ByteString.Lazy as B import qualified Data.ByteString as P (hGetNonBlocking, null) import Data.ByteString.Base (LazyByteString(LPS)) import Foreign.Storable (sizeOf) import System.IO (openFile, hClose, hSeek, hTell, hIsEOF, hWaitForInput, Handle, IOMode(ReadMode, WriteMode), SeekMode(AbsoluteSeek)) import System.IO.Unsafe (unsafeInterleaveIO)
import qualified Data.Edison.Seq.ListSeq as LS import qualified Data.Edison.Coll.SplayHeap as Splay
Conceptually, we sort a list in blocks, spool blocks to disk, then merge back. However for IO performance it is better to read off chunks of elements off the sorted blocks from disk instead of elements-at-a-time.
It would be better if these were in KBytes instead of # of elements.
blocksize :: Int blocksize = 10000
Turn a list into a list of chunks.
slice :: Int -> [a] -> [[a]] slice _ [] = [] slice size l = (take size l) : (slice size $ drop size l)
That's unnecessary parenthesis, and I'd probably use splitAt here: myslice :: Int -> [a] -> [[a]] myslice _ [] = [] myslice n xs = a : myslice n b where (a,b) = splitAt n xs And just to check: *M> :m + Test.QuickCheck *M Test.QuickCheck> quickCheck (\n (xs :: [Int]) -> n > 0 ==> slice n xs == myslice n xs) OK, passed 100 tests.
Turn a list into a list of blocks, each of which is sorted.
blockify :: (Ord a) => Int -> [a] -> [[a]] blockify bsize l = map sort $ slice bsize l
Possibly you could drop the 'l' parameter: blockify n = map sort . slice n
Serialize a block, returning the (absolute) position of the start.
dumpBlock :: (Ord a, Bin.Binary a) => Handle -> [a] -> IO Integer dumpBlock h b = do start <- hTell h B.hPut h $ Bin.encode b return start
The actual sorting function. We blockify the list, turning it into a list of sorted blocks, and spool to disk, keeping track of offsets. We then read back the blocks (lazily!), and merge them.
externalSort [] = do return [] externalSort l = do h <- openFile "ExternalSort.bin" WriteMode idx <- mapM (\x -> dumpBlock h x) (blockify blocksize l)
idx <- mapM (dumpBlock h) (blockify blocksize l)
hClose h h <- openFile "ExternalSort.bin" ReadMode blocks <- mapM (\x -> do {bs <- hGetContentsWithCursor h x; return $ Bin.decode bs}) idx
Possibly forM idx $ \x -> decode `fmap` hGetContentsWithCursor h x
return (kMerge $ blocks)
Merging chunks. K-way merge (and in fact external sort in general) is detailed in Knuth, where he recommends tournament trees. The easiest thing is to probably use one of Okasaki's heaps. I'll use splay heaps, because I don't know any better.
It would be better if I changed Ord for blocks to only check the first element.
kMerge :: (Ord a) => [[a]] -> [a] kMerge [] = [] kMerge l = let h = Splay.fromSeq l in kM (Splay.minElem h) (Splay.deleteMin h) where kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a] kM l h | h == Splay.empty = l | otherwise = let next = Splay.minElem h (f, b) = span (\x -> x <= head next) l in f ++ (kM next (if null b then Splay.deleteMin h else (Splay.insert b $ Splay.deleteMin h)))
kMergeSort :: (Ord a) => [a] -> [a] kMergeSort l = kMerge $ blockify blocksize l
This is a version of hGetContents which resets its handle position between reads, so is safe to use with interleaved handle seeking.
hGetContentsWithCursor :: Handle -> Integer -> IO B.ByteString hGetContentsWithCursor = hGetContentsWithCursorN defaultChunkSize
hGetContentsWithCursorN :: Int -> Handle -> Integer -> IO B.ByteString hGetContentsWithCursorN k h start = (lazyRead start) >>= return . LPS where lazyRead start = unsafeInterleaveIO $ loop start
loop start = do hSeek h AbsoluteSeek start ps <- P.hGetNonBlocking h k --TODO: I think this should distinguish EOF from no data available -- the otherlying POSIX call makes this distincion, returning either -- 0 or EAGAIN if P.null ps then do eof <- hIsEOF h if eof then return [] else hWaitForInput h (-1) >> (loop start) else do pos <- hTell h pss <- lazyRead pos return (ps : pss)
Very nice!
defaultChunkSize :: Int defaultChunkSize = 32 * k - overhead where k = 1024 overhead = 2 * sizeOf (undefined :: Int)
We'll export this value in bytestring 1.0. I like this code. Would you consider cabalising it, and uploading it to hackage.haskell.org, so we don't lose it? Perhaps just call it hsort or something? Cheers, Don

hi --
thanks for the useful comments! i will definitely go through them
carefully. unfortunately for this code (but fortunately for me) i
defend my dissertation on monday so i'm a little distracted right
now.....
i'm more than happy to donate this code or whatever improvements
happen to it. actually, hGetContentsWithCursor seems like a candidate
for inclusion with Data.ByteStrings or Data.Binary or something -- it
seems like it might find other uses. (i think you liked that bit of
code because i ripped it off of you guys! it's very short hamming
distance from the original.) anyhow, all that will have to wait a
couple weeks or so. also i've never cabalized anything so i may come
begging for help.
at some point i thought i saw how to do recursive external sort, to
keep memory usage truly constant, but with my current lack of sleep i
have lost that illusion. i'm also curious about the performance
characteristics of this vs Prelude sort vs the version using the
tournament mergesort apfelmus suggested. i need to find a computer
with a lot more RAM than my weakling laptop. finally, it would be
good to be able to have the blocksize controlled by Kb of RAM rather
than # of elements, not sure how to get that information.
ultimately this was part of my project to write lucene for haskell. i
think with this out of the way, plus all the Data.Binary / ByteString
goodness, it shouldn't take too long. keep writing good libraries for
me!
thanks and take care, Ben
On 7/17/07, Donald Bruce Stewart
midfield:
hi folks --
a haskell newbie here, searching for comments and wisdom on my code.
i had a project to try to implement "external sort" in haskell as a learning exercise. (external sort is sorting a list that is too large to fit in main memory, by sorting in chunks, spooling to disk, and then merging. more properly there probably should be multiple stages, but for simplicity i'm doing a one-stage external sort.)
the trick is the number of files can quickly grow very large, so it is best to use one large file and seek inside it around. however as one can imagine the order-of-IO-operations becomes a bit tricky, if you're seeking file handles around underneath Data.ByteString.Lazy's nose. but late this night after not thinking about it for a while i had a brainstorm: rewrite hGetContents to keep the handle position in the right place! it's all about judicious use of unsafeInterleaveIO.....
it seems to be rather fast, strangely faster than the usual "sort" at times. it also seems to have nice memory characteristics, though not perfect. it's hard to test because the normal "sort" function takes too much RAM on large lists, making my computer swap like mad.
I have to agree with Mr. Apfelmus here. This is lovely code. It is exactly what the ByteString team hoped people would be able to write ByteStrings: "Zen of Haskell" code, where you win by working at a high level, rather than a low level.
Thanks!
I've inserted some small comments though the source:
module ExternalSort where
Sort a list of Ords "offline." We're doing this to be able to sort things without taking up too much memory (for example sorting lists too large to fit in RAM.) Laziness is imperative, as is the order-of-operations.
import Control.Monad import Data.List import qualified Data.Binary as Bin import qualified Data.ByteString.Lazy as B import qualified Data.ByteString as P (hGetNonBlocking, null) import Data.ByteString.Base (LazyByteString(LPS)) import Foreign.Storable (sizeOf) import System.IO (openFile, hClose, hSeek, hTell, hIsEOF, hWaitForInput, Handle, IOMode(ReadMode, WriteMode), SeekMode(AbsoluteSeek)) import System.IO.Unsafe (unsafeInterleaveIO)
import qualified Data.Edison.Seq.ListSeq as LS import qualified Data.Edison.Coll.SplayHeap as Splay
Conceptually, we sort a list in blocks, spool blocks to disk, then merge back. However for IO performance it is better to read off chunks of elements off the sorted blocks from disk instead of elements-at-a-time.
It would be better if these were in KBytes instead of # of elements.
blocksize :: Int blocksize = 10000
Turn a list into a list of chunks.
slice :: Int -> [a] -> [[a]] slice _ [] = [] slice size l = (take size l) : (slice size $ drop size l)
That's unnecessary parenthesis, and I'd probably use splitAt here:
myslice :: Int -> [a] -> [[a]] myslice _ [] = [] myslice n xs = a : myslice n b where (a,b) = splitAt n xs
And just to check:
*M> :m + Test.QuickCheck *M Test.QuickCheck> quickCheck (\n (xs :: [Int]) -> n > 0 ==> slice n xs == myslice n xs) OK, passed 100 tests.
Turn a list into a list of blocks, each of which is sorted.
blockify :: (Ord a) => Int -> [a] -> [[a]] blockify bsize l = map sort $ slice bsize l
Possibly you could drop the 'l' parameter:
blockify n = map sort . slice n
Serialize a block, returning the (absolute) position of the start.
dumpBlock :: (Ord a, Bin.Binary a) => Handle -> [a] -> IO Integer dumpBlock h b = do start <- hTell h B.hPut h $ Bin.encode b return start
The actual sorting function. We blockify the list, turning it into a list of sorted blocks, and spool to disk, keeping track of offsets. We then read back the blocks (lazily!), and merge them.
externalSort [] = do return [] externalSort l = do h <- openFile "ExternalSort.bin" WriteMode idx <- mapM (\x -> dumpBlock h x) (blockify blocksize l)
idx <- mapM (dumpBlock h) (blockify blocksize l)
hClose h h <- openFile "ExternalSort.bin" ReadMode blocks <- mapM (\x -> do {bs <- hGetContentsWithCursor h x; return $ Bin.decode bs}) idx
Possibly
forM idx $ \x -> decode `fmap` hGetContentsWithCursor h x
return (kMerge $ blocks)
Merging chunks. K-way merge (and in fact external sort in general) is detailed in Knuth, where he recommends tournament trees. The easiest thing is to probably use one of Okasaki's heaps. I'll use splay heaps, because I don't know any better.
It would be better if I changed Ord for blocks to only check the first element.
kMerge :: (Ord a) => [[a]] -> [a] kMerge [] = [] kMerge l = let h = Splay.fromSeq l in kM (Splay.minElem h) (Splay.deleteMin h) where kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a] kM l h | h == Splay.empty = l | otherwise = let next = Splay.minElem h (f, b) = span (\x -> x <= head next) l in f ++ (kM next (if null b then Splay.deleteMin h else (Splay.insert b $ Splay.deleteMin h)))
kMergeSort :: (Ord a) => [a] -> [a] kMergeSort l = kMerge $ blockify blocksize l
This is a version of hGetContents which resets its handle position between reads, so is safe to use with interleaved handle seeking.
hGetContentsWithCursor :: Handle -> Integer -> IO B.ByteString hGetContentsWithCursor = hGetContentsWithCursorN defaultChunkSize
hGetContentsWithCursorN :: Int -> Handle -> Integer -> IO B.ByteString hGetContentsWithCursorN k h start = (lazyRead start) >>= return . LPS where lazyRead start = unsafeInterleaveIO $ loop start
loop start = do hSeek h AbsoluteSeek start ps <- P.hGetNonBlocking h k --TODO: I think this should distinguish EOF from no data available -- the otherlying POSIX call makes this distincion, returning either -- 0 or EAGAIN if P.null ps then do eof <- hIsEOF h if eof then return [] else hWaitForInput h (-1) >> (loop start) else do pos <- hTell h pss <- lazyRead pos return (ps : pss)
Very nice!
defaultChunkSize :: Int defaultChunkSize = 32 * k - overhead where k = 1024 overhead = 2 * sizeOf (undefined :: Int)
We'll export this value in bytestring 1.0.
I like this code. Would you consider cabalising it, and uploading it to hackage.haskell.org, so we don't lose it? Perhaps just call it hsort or something?
Cheers, Don

midfield:
hi --
thanks for the useful comments! i will definitely go through them carefully. unfortunately for this code (but fortunately for me) i defend my dissertation on monday so i'm a little distracted right now.....
i'm more than happy to donate this code or whatever improvements happen to it. actually, hGetContentsWithCursor seems like a candidate for inclusion with Data.ByteStrings or Data.Binary or something -- it seems like it might find other uses. (i think you liked that bit of code because i ripped it off of you guys! it's very short hamming
Can't fault that style ;)
distance from the original.) anyhow, all that will have to wait a couple weeks or so. also i've never cabalized anything so i may come begging for help.
We have a tutorial for that, luckily: http://haskell.org/haskellwiki/How_to_write_a_Haskell_program And a tool to automate it, mkcabal, so should be fairly straightforward.
at some point i thought i saw how to do recursive external sort, to keep memory usage truly constant, but with my current lack of sleep i have lost that illusion. i'm also curious about the performance characteristics of this vs Prelude sort vs the version using the tournament mergesort apfelmus suggested. i need to find a computer with a lot more RAM than my weakling laptop. finally, it would be good to be able to have the blocksize controlled by Kb of RAM rather than # of elements, not sure how to get that information.
ultimately this was part of my project to write lucene for haskell. i think with this out of the way, plus all the Data.Binary / ByteString goodness, it shouldn't take too long. keep writing good libraries for me!
Great. Good to see things working. -- Don

I kinda-sorta half-cabalized it at
darcs get http://darcsdump.dreamhosters.com/external-sort (untested
via cabal install but mostly done)
As soon as my project gets approved I'll put it up on hackage.
If Ben wants it under his account at hackage of course I'll defer to him.
Thomas.
2007/7/18 Donald Bruce Stewart
midfield:
hi --
thanks for the useful comments! i will definitely go through them carefully. unfortunately for this code (but fortunately for me) i defend my dissertation on monday so i'm a little distracted right now.....
i'm more than happy to donate this code or whatever improvements happen to it. actually, hGetContentsWithCursor seems like a candidate for inclusion with Data.ByteStrings or Data.Binary or something -- it seems like it might find other uses. (i think you liked that bit of code because i ripped it off of you guys! it's very short hamming
Can't fault that style ;)
distance from the original.) anyhow, all that will have to wait a couple weeks or so. also i've never cabalized anything so i may come begging for help.
We have a tutorial for that, luckily:
http://haskell.org/haskellwiki/How_to_write_a_Haskell_program
And a tool to automate it, mkcabal, so should be fairly straightforward.
at some point i thought i saw how to do recursive external sort, to keep memory usage truly constant, but with my current lack of sleep i have lost that illusion. i'm also curious about the performance characteristics of this vs Prelude sort vs the version using the tournament mergesort apfelmus suggested. i need to find a computer with a lot more RAM than my weakling laptop. finally, it would be good to be able to have the blocksize controlled by Kb of RAM rather than # of elements, not sure how to get that information.
ultimately this was part of my project to write lucene for haskell. i think with this out of the way, plus all the Data.Binary / ByteString goodness, it shouldn't take too long. keep writing good libraries for me!
Great. Good to see things working.
-- Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, Jul 17, 2007 at 04:16:29AM -0500, Ben wrote:
hi folks --
a haskell newbie here, searching for comments and wisdom on my code.
i had a project to try to implement "external sort" in haskell as a learning exercise. (external sort is sorting a list that is too large to fit in main memory, by sorting in chunks, spooling to disk, and then merging. more properly there probably should be multiple stages, but for simplicity i'm doing a one-stage external sort.)
If you have practical (as opposed to pedagogical) reasons for this, you should try using 'sort' from GNU coreutils (standard sort on linuxen, shouldn't be terribly hard to make work on other systems if you know C). The inner loop is coded with pretty bad constant factors, but it *is* an external sort, and has happily sorted 20GB files on my (300MB core) system; only taking an hour. Stefan
participants (5)
-
apfelmus
-
Ben
-
dons@cse.unsw.edu.au
-
Stefan O'Rear
-
Thomas Hartman