Fast sorting with Bytestring

18 Jun 2008

      Hi,

I played a bit around with the nice bytestring package. At some point I
implemented a simple sorting program, because I needed line-sorting a file
with a custom line-compare function. I was a bit surprised, that the
resulting code is very fast. A lot of faster than sorting via GNU sort
(with the standard line-compare function).

Then I got suspicious and tested standard GNU sort against a trivial
standard sort implementation in Haskell using bytestring. The Haskell
implementation looks like:
...
module Main
where
...
import qualified Data.ByteString.Lazy.Char8 as B
import Data.List
...
main = do
         c <- B.getContents
         let l = B.lines c
         let r = sort l
         B.putStr $ B.unlines r
Sorting a file of '[]'-strings I get:

time ./sort < brackets > /dev/null  # ~  0.01 s
time sort < brackets > /dev/null    # ~ 14    s

Ok, strane ... Well, let's test with some 'normal' text:

time ./sort < bible > /dev/null       # ~  0.4   s
time sort < bible > /dev/null         # ~  0.56  s

Ok, not that different. But with Haskell you often expect to get very
slow code compared to an implementation in C.

And I am surprised, that the Haskell is fast _and_ nice to read - because
for example the ultra fast 'wc -l' Haskell implementation from the
Haskell-Wiki uses some insider-knowledge about bytestring, and looks to
a beginner not that intuitive, I guess.

./sort is the shown Haskell implementation. I used ghc 6.8.2, installed
bytestring 0.9.1.0 as a user-local package (don't now I this superseeds
the global bytestring package), compiled via ghc -O2. The tests are run at a
Pentium M 1.3 GHz computer. As GNU sort I tested the version from
coreutils 6.9.

You can get the test files from:
http://www.techfak.uni-bielefeld.de/~gsauthof/stuff/brackets.gz
http://www.techfak.uni-bielefeld.de/~gsauthof/stuff/bible.gz
(obviously, you have to gunzip them after downloading ...)

Of course, the naive Haskell implementation doesn't care about locales
(i.e. collating locale sequences), but this shouldn't explain the
observed differences sorting the brackets file.

Best regards
Georg Sauthoff
-- 
Fortune : 'Real programmers don't comment their code.
           It was hard to write, it should be hard to understand.' ;)

gsauthof＠TechFak.Uni-Bielefeld.DE

Stefan O'Rear

gsauthof＠TechFak.Uni-Bielefeld.DE

Ketil Malde

gsauthof＠TechFak.Uni-Bielefeld.DE

tags

participants (3)