[Haskell-cafe] Re: Need for speed: the Burrows-Wheeler Transform

23 Jun 2007

      David Roundy wrote:
...
On Fri, Jun 22, 2007 at 09:26:40PM +0100, Andrew Coppin wrote:
...
OK, so I *started* writing a request for help, but... well I answered my 
own question! See the bottom...
....
...
module BWT3 (bwt) where
import Data.List
import qualified Data.ByteString as Raw
rotate :: Int -> Raw.ByteString -> Int -> Raw.ByteString
rotate l xs n = (Raw.drop (l-n) xs) `Raw.append` (Raw.take (l-n) xs)
bwt xs =
 let l  = Raw.length xs
     ys = rotate l xs
 in  Raw.pack $
     map (Raw.last . rotate l xs) $
     sortBy (\n m -> compare (ys n) (ys m)) [0..(l-1)]
The trouble is that Raw.append is an O(N) operation, making the computation
O(N^2) where it ought to be O(NlogN).
Well, it actually takes O(N^2 log N) time: N log N comparisons each
taking O(N) time. With O(1) append, each comparison still has a
worst-case bound of O(N) (take the degenerate input (repeat 'a')) but it
will probably be much faster in practice. The log N factor can be
squeezed to a constant by using a simple trie, but to be even faster,
you need a clever suffix tree/array. IIRC, O(N) is the best worst case
bound for the burrows wheeler transform although they're said to be not
so good in practice.

Note that the one usually adds an "end of string" character $ in the
Burrows-Wheeler transform for compression such that sorting rotated
strings becomes sorting suffices.

Concerning the algorithm at hand, you can clearly avoid calculating
Raw.append over and over:

  bwt :: Raw.ByteString -> Raw.ByteString
  bwt xs = Raw.pack . map (Raw.last) . sort $ rotations
    where
    n         = length xs
    rotations = take n . map (take n) . tails $ xs `Raw.append` xs

assuming that take n is O(1).
...
...
Woah... What the hell? I just switched to Data.ByteString.Lazy and WHAM! 
Vast speed increases... Jeepers, I can transform 52 KB so fast I can't 
even get to Task Manager fast enough to *check* the RAM usage! Blimey...
OK, just tried the 145 KB test file that Mr C++ used. That took 2 
seconds + 43 MB RAM. Ouch.
In this case append is an O(1) operation.  But you're still getting killed
on prefactors, because you're generating a list of size N and then sorting
it.  Lists are just not nice data structures to sort, nor are they nice to
have for large N.
To get better speed and memory use, I think you'd want to avoid the
intermediate list in favor of some sort of strict array, but that'd be
ugly.
Eh? You can't really avoid the list for sorting, in the sense that
before trying an array/in-place sort, you'd better use a trie.

Regards,
apfelmus