Fast sorting with Bytestring

Hi, I played a bit around with the nice bytestring package. At some point I implemented a simple sorting program, because I needed line-sorting a file with a custom line-compare function. I was a bit surprised, that the resulting code is very fast. A lot of faster than sorting via GNU sort (with the standard line-compare function). Then I got suspicious and tested standard GNU sort against a trivial standard sort implementation in Haskell using bytestring. The Haskell implementation looks like:
module Main where
import qualified Data.ByteString.Lazy.Char8 as B import Data.List
main = do c <- B.getContents let l = B.lines c let r = sort l B.putStr $ B.unlines r
Sorting a file of '[]'-strings I get: time ./sort < brackets > /dev/null # ~ 0.01 s time sort < brackets > /dev/null # ~ 14 s Ok, strane ... Well, let's test with some 'normal' text: time ./sort < bible > /dev/null # ~ 0.4 s time sort < bible > /dev/null # ~ 0.56 s Ok, not that different. But with Haskell you often expect to get very slow code compared to an implementation in C. And I am surprised, that the Haskell is fast _and_ nice to read - because for example the ultra fast 'wc -l' Haskell implementation from the Haskell-Wiki uses some insider-knowledge about bytestring, and looks to a beginner not that intuitive, I guess. ./sort is the shown Haskell implementation. I used ghc 6.8.2, installed bytestring 0.9.1.0 as a user-local package (don't now I this superseeds the global bytestring package), compiled via ghc -O2. The tests are run at a Pentium M 1.3 GHz computer. As GNU sort I tested the version from coreutils 6.9. You can get the test files from: http://www.techfak.uni-bielefeld.de/~gsauthof/stuff/brackets.gz http://www.techfak.uni-bielefeld.de/~gsauthof/stuff/bible.gz (obviously, you have to gunzip them after downloading ...) Of course, the naive Haskell implementation doesn't care about locales (i.e. collating locale sequences), but this shouldn't explain the observed differences sorting the brackets file. Best regards Georg Sauthoff -- Fortune : 'Real programmers don't comment their code. It was hard to write, it should be hard to understand.' ;)

On Wed, Jun 18, 2008 at 08:19:10PM +0200, Georg Sauthoff wrote:
Hi,
I played a bit around with the nice bytestring package. At some point I implemented a simple sorting program, because I needed line-sorting a file with a custom line-compare function. I was a bit surprised, that the resulting code is very fast. A lot of faster than sorting via GNU sort (with the standard line-compare function).
Then I got suspicious and tested standard GNU sort against a trivial standard sort implementation in Haskell using bytestring. The Haskell implementation looks like:
[snip]
Ok, strane ... Well, let's test with some 'normal' text:
time ./sort < bible > /dev/null # ~ 0.4 s time sort < bible > /dev/null # ~ 0.56 s
Ok, not that different. But with Haskell you often expect to get very slow code compared to an implementation in C.
And I am surprised, that the Haskell is fast _and_ nice to read - because for example the ultra fast 'wc -l' Haskell implementation from the Haskell-Wiki uses some insider-knowledge about bytestring, and looks to a beginner not that intuitive, I guess.
./sort is the shown Haskell implementation. I used ghc 6.8.2, installed bytestring 0.9.1.0 as a user-local package (don't now I this superseeds the global bytestring package), compiled via ghc -O2. The tests are run at a Pentium M 1.3 GHz computer. As GNU sort I tested the version from coreutils 6.9.
You can get the test files from: http://www.techfak.uni-bielefeld.de/~gsauthof/stuff/brackets.gz http://www.techfak.uni-bielefeld.de/~gsauthof/stuff/bible.gz (obviously, you have to gunzip them after downloading ...)
Of course, the naive Haskell implementation doesn't care about locales (i.e. collating locale sequences), but this shouldn't explain the observed differences sorting the brackets file.
GNU 'sort' uses an external sort algorithm. You can, with 200M of memory, give it a 50G input file, and it will work. This might explain the difference.. Stefan

On Wed, Jun 18, 2008 at 11:23:00AM -0700, Stefan O'Rear wrote:
GNU 'sort' uses an external sort algorithm. You can, with 200M of memory, give it a 50G input file, and it will work. This might explain the difference..
But the input files are both < 10 mb ... If I create a 'big_bible' file like 'cat bible bible | dd of=big_bible count=1 bs=SIZE_OF_BRACKETS' then the timings are: time sort < big_bible > /dev/null # ~ 1.2 s time ./sort < big_bible > /dev/null # ~ 0.8 s I.e. the difference here is like a low constant factor, and not like a factor of 1000 with the brackets. Best regards Georg Sauthoff -- Fortune : 'Real programmers don't comment their code. It was hard to write, it should be hard to understand.' ;)

Stefan O'Rear
Ok, strane ... Well, let's test with some 'normal' text:
time ./sort < bible > /dev/null # ~ 0.4 s time sort < bible > /dev/null # ~ 0.56 s
Ok, not that different. But with Haskell you often expect to get very slow code compared to an implementation in C.
GNU 'sort' uses an external sort algorithm. You can, with 200M of memory, give it a 50G input file, and it will work. This might explain the difference..
GNU sort also handles locale-based comparison, which I suspect explains more of the difference. Try with LC_ALL=C. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Wed, Jun 18, 2008 at 08:56:43PM +0200, Ketil Malde wrote:
Stefan O'Rear
writes: GNU 'sort' uses an external sort algorithm. You can, with 200M of memory, give it a 50G input file, and it will work. This might explain the difference..
GNU sort also handles locale-based comparison, which I suspect explains more of the difference. Try with LC_ALL=C.
Well, for input 'bible' with locale en_US.utf8 GNU sort is ~ 6 times slower than with locale C. But for input 'brackets' it is ~ 433 times slower ... Best regards Georg Sauthoff -- Fortune : 'Real programmers don't comment their code. It was hard to write, it should be hard to understand.' ;)
participants (3)
-
gsauthof@TechFak.Uni-Bielefeld.DE
-
Ketil Malde
-
Stefan O'Rear