Friday Afternoon Play: The Burrows Wheeler Transformation

So, it's Friday afternoon (well, here anyway), and presumably everyone is thinking of leaving for the weekend. Is anyone interested in playing a little game? I'm going to make the first move, and then just wait and see if anyone makes the next. I'm certainly not promising to make any further moves myself, but I'm not promising I won't, either. Below you'll see a naïve implementation of the Burrows Wheeler transformation and its inverse. That's my move. (And don't complain too much about style and such -- I'm half asleep.) Further moves are made by transforming the programme so that it becomes more efficient. The first target is not raw speed, but to adjust it until the computational complexity gets to be what it should be. I'm hoping for elegant moves here. After that, folk might start replacing the lists with more efficient datastructures and so on, but still /elegantly/. The object of the game isn't ultimate speed -- I've no particular wish to see a Haskell version that's as fast as a good C version, especially not if it ends up /looking/ like a C version. Really the objective is to think about Haskell programming -- and probably come up with some observations such as "there really ought to be a class of indexable structures, so that we don't have to change (!!) to (!) all over the place". And to have fun, anyhow... * * * Demonstration of the principle of the Burrows Wheeler transform. The implementations are simplistic (and have worse complexity than is possible).
module Burrows_Wheeler_Transform where import List
This is surely the wrong module to have to find (><) from:
import Data.Graph.Inductive.Query.Monad((><))
The forward transformation. Input a list, output the transformed list and possibly an int (Nothing if there were no elements in the input -- perhaps the result type should be Maybe (NonEmptyList t,Int))
slow_bwt:: Ord t => [t] -> ([t], Maybe Int)
This version works by computing a list of all the rotations of the input and sorting it. The transformation is then the last element of each line of the sorted rotations. Tag the data with an index so that it's possible to find where the original form of the unput ends up after sorting. This is essentially the specification of the transform coded up directly.
slow_bwt l = (map fst tagged_bwt, index_of_original) where tagged_bwt = map (last>
that looks like worst case n²×log n to me (where n is the length of the string): if every element is the same, the n×log n comparisons the sort does will each look at every element, and that's really more than is necessary. The inverse transform. We should have that @ slow_inv_bwt . slow_bwt == id @ Observe that sorting the input gives the first column of the sorted rotations (and the input is the last). So after one sort we have X ... c Y ... b Z ... a . . . . . . . . . Since they are rotations, rotating each row of this gives pairs that belong to the original list: cX ... by ... aZ ... . . . . . . sorting these gives us the first two columns of the sorted rotations, and again we already know the last column, which can be rotated into the first and so on.
slow_inv_bwt ([],_) = [] slow_inv_bwt (l,Just index_of_original) = iterate catsort (replicate len [])
After (length input) iterations we have the original sorted rotations,
!! len
and we have the index of where the untransformed string is, so we can just pick it back out.
!! index_of_original
where catsort s = sort $ map (uncurry (:)) $ l `zip`s len = length l
This version does rather fewer sorts:
mark2_inv_bwt ([],_) = [] mark2_inv_bwt (l,Just n) = (transpose $ take (length l) $ iterate (permuteWith perm) l)!!(perm!!n) where perm = map snd $ sort $ l `zip` [0..]
Should I simply have left it out and hoped that it would turn up after a few moves? I'm not going to explain it, anyway.
rotations l = init $ zipWith (++) (tails l) (inits l)
permuteWith ns l = map (l!!) ns
-- Jón Fairbairn Jon.Fairbairn@cl.cam.ac.uk

Hello Jón, sadly, nobody wants to play with you. Most likely, it's because the next move is too hard: for a faster forward transform, you need (something as tricky as) suffix trees. See also Robert Giegerich, Stefan Kurtz: "A Comparison of Imperative and Purely Functional Suffix Tree Constructions" http://citeseer.ist.psu.edu/giegerich95comparison.html Regards, apfelmus Jón Fairbairn wrote:
So, it's Friday afternoon (well, here anyway), and presumably everyone is thinking of leaving for the weekend. Is anyone interested in playing a little game? I'm going to make the first move, and then just wait and see if anyone makes the next. I'm certainly not promising to make any further moves myself, but I'm not promising I won't, either.
Below you'll see a naïve implementation of the Burrows Wheeler transformation and its inverse. That's my move. (And don't complain too much about style and such -- I'm half asleep.) Further moves are made by transforming the programme so that it becomes more efficient. The first target is not raw speed, but to adjust it until the computational complexity gets to be what it should be. I'm hoping for elegant moves here. After that, folk might start replacing the lists with more efficient datastructures and so on, but still /elegantly/.
The object of the game isn't ultimate speed -- I've no particular wish to see a Haskell version that's as fast as a good C version, especially not if it ends up /looking/ like a C version. Really the objective is to think about Haskell programming -- and probably come up with some observations such as "there really ought to be a class of indexable structures, so that we don't have to change (!!) to (!) all over the place". And to have fun, anyhow...
* * *
Demonstration of the principle of the Burrows Wheeler transform. The implementations are simplistic (and have worse complexity than is possible).
module Burrows_Wheeler_Transform where import List
This is surely the wrong module to have to find (><) from:
import Data.Graph.Inductive.Query.Monad((><))
The forward transformation. Input a list, output the transformed list and possibly an int (Nothing if there were no elements in the input -- perhaps the result type should be Maybe (NonEmptyList t,Int))
slow_bwt:: Ord t => [t] -> ([t], Maybe Int)
This version works by computing a list of all the rotations of the input and sorting it. The transformation is then the last element of each line of the sorted rotations. Tag the data with an index so that it's possible to find where the original form of the unput ends up after sorting. This is essentially the specification of the transform coded up directly.
slow_bwt l = (map fst tagged_bwt, index_of_original) where tagged_bwt = map (last>
that looks like worst case n²×log n to me (where n is the length of the string): if every element is the same, the n×log n comparisons the sort does will each look at every element, and that's really more than is necessary.
The inverse transform. We should have that
@ slow_inv_bwt . slow_bwt == id @
Observe that sorting the input gives the first column of the sorted rotations (and the input is the last). So after one sort we have
X ... c Y ... b Z ... a .. . . .. . . .. . .
Since they are rotations, rotating each row of this gives pairs that belong to the original list:
cX ... by ... aZ ... .. . .. . .. .
sorting these gives us the first two columns of the sorted rotations, and again we already know the last column, which can be rotated into the first and so on.
slow_inv_bwt ([],_) = [] slow_inv_bwt (l,Just index_of_original) = iterate catsort (replicate len [])
After (length input) iterations we have the original sorted rotations,
!! len
and we have the index of where the untransformed string is, so we can just pick it back out.
!! index_of_original
where catsort s = sort $ map (uncurry (:)) $ l `zip`s len = length l
This version does rather fewer sorts:
mark2_inv_bwt ([],_) = [] mark2_inv_bwt (l,Just n) = (transpose $ take (length l) $ iterate (permuteWith perm) l)!!(perm!!n) where perm = map snd $ sort $ l `zip` [0..]
Should I simply have left it out and hoped that it would turn up after a few moves? I'm not going to explain it, anyway.
rotations l = init $ zipWith (++) (tails l) (inits l)
permuteWith ns l = map (l!!) ns

apfelmus@quantentunnel.de wrote:
sadly, nobody wants to play with you. Okay, here's a small possible contribution. Most likely, it's because the next move is too hard: for a faster forward transform, you need (something as tricky as) suffix trees.
rotations l = init $ zipWith (++) (tails l) (inits l)
I haven't done the benchmarks, but I'm a bit disturbed by the number of
To make it fast, I guess you'll need a suffix array. (And in a sense,
that is what we're making when we sort the list of rotations, but
of course the time complexity is way worse than the linear time
SA algorithms.)
list traversals. It seems to me that instead of sorting the rotations and
taking the last element, it would be more efficient to sort the
rotations by the tails,
and taking the head? Especially if the string contains little
redundancy, the
sorting wouldn't need to traverse very deeply. Here's a (rather ugly)
attempt,
which seems to work:
\begin{code}
slow_bwt2:: Ord t => [t] -> ([t], Maybe Int)
slow_bwt2 l
= (map fst tagged_bwt, index_of_original)
where tagged_bwt = map (head>

On Tue, Nov 21, 2006 at 09:53:09AM +0100, Ketil Malde wrote:
sortBy ((. tail . fst) . compare . tail . fst)
this would actually be a nice thing to have in the standard libraries: sortOn :: (Ord b) => (a -> b) -> [a] -> [a] sortOn f = sortBy $ (.f) . compare . f kibitzing (: matthias

Matthias Fischmann wrote:
sortBy ((. tail . fst) . compare . tail . fst)
this would actually be a nice thing to have in the standard libraries:
sortOn :: (Ord b) => (a -> b) -> [a] -> [a] sortOn f = sortBy $ (.f) . compare . f I think the recommended way is to define
on f g = \x y -> f (g x) (g y) That way, you can do sortBy (compare `on` snd) [(1,2),(2,1)] => [(2,1),(1,2)] -k

On Tue, 21 Nov 2006, Matthias Fischmann wrote:
On Tue, Nov 21, 2006 at 09:53:09AM +0100, Ketil Malde wrote:
sortBy ((. tail . fst) . compare . tail . fst)
this would actually be a nice thing to have in the standard libraries:
sortOn :: (Ord b) => (a -> b) -> [a] -> [a] sortOn f = sortBy $ (.f) . compare . f
http://www.haskell.org/pipermail/libraries/2006-November/006169.html

Could we please only change the subject of threads when it's
necessary? It confuses my mail reader which then confuses me. Of
course it's not your fault I choose to use an easily confused mail
reader, but at the same time I don't understand the point of changing
the subject of the thread. And not just this thread; I see it happen
fairly often but this thread seems to have done it a lot in a short
period of time making it a good example.
Jason
On 11/21/06, Henning Thielemann
On Tue, 21 Nov 2006, Matthias Fischmann wrote:
On Tue, Nov 21, 2006 at 09:53:09AM +0100, Ketil Malde wrote:
sortBy ((. tail . fst) . compare . tail . fst)
this would actually be a nice thing to have in the standard libraries:
sortOn :: (Ord b) => (a -> b) -> [a] -> [a] sortOn f = sortBy $ (.f) . compare . f
http://www.haskell.org/pipermail/libraries/2006-November/006169.html _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (6)
-
apfelmus@quantentunnel.de
-
Henning Thielemann
-
Jason Dagit
-
Jón Fairbairn
-
Ketil Malde
-
Matthias Fischmann