Need for speed: the Burrows-Wheeler Transform

OK, so I *started* writing a request for help, but... well I answered my own question! See the bottom... I was reading Wikipedia, and I found this: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform I decided to sit down and see what that looks like in Haskell. I came up with this: module BWT where import Data.List rotate1 :: [x] -> [x] rotate1 [] = [] rotate1 xs = last xs : init xs rotate :: [x] -> [[x]] rotate xs = take (length xs) (iterate rotate1 xs) bwt :: (Ord x) => [x] -> [x] bwt = map last . sort . rotate step :: (Ord x) => [x] -> [[x]] -> [[x]] step xs = zipWith (:) xs . sort inv_bwt :: (Ord x) => x -> [x] -> [x] inv_bwt mark xs = head . filter (\xs -> head xs == mark) . head . drop ((length xs) - 1) . iterate (step xs) . map (\x -> [x]) $ xs My my, isn't that SO much shorter? I love Haskell! :-D Unfortunately, the resident C++ expert fails to grasp the concept of "example code", and insists on comparing the efficiency of this program to the C one on the website. Fact is, he's translated the presented C into C++, and it can apparently transform a 145 KB file in 8 seconds using only 3 MB of RAM. The code above, however, took about 11 seconds to transform 4 KB of text, and that required about 60 MB of RAM. (I tried larger, but the OS killed the process for comsuming too much RAM.) Well anyway, the code was written for simplicity, not efficiency. I've tried to explain this, but apparently that is beyond his comprehension. So anyway, it looks like we have a race on. >:-D The first thing I did was the optimisation mentioned on Wikipedia: you don't *need* to build a list of lists. You can just throw pointers around. So I arrived at this: module BWT2 (bwt) where import Data.List rotate :: Int -> [x] -> Int -> [x] rotate l xs n = (drop (l-n) xs) ++ (take (l-n) xs) bwt xs = let l = length xs ys = rotate l xs in map (last . rotate l xs) $ sortBy (\n m -> compare (ys n) (ys m)) [0..(l-1)] This is indeed *much* faster. With this, I can transform 52 KB of text in 9 minutes + 60 MB RAM. The previous version seemed to have quadratic memory usage, whereas this one seems to be linear. 52 KB would have taken many months with the first version! Still, 9 minutes (for a file 3 times smaller) is nowhere near 8 seconds. So we must try harder... For my next trick, ByteStrings! (Never used them before BTW... this is my first try!) module BWT3 (bwt) where import Data.List import qualified Data.ByteString as Raw rotate :: Int -> Raw.ByteString -> Int -> Raw.ByteString rotate l xs n = (Raw.drop (l-n) xs) `Raw.append` (Raw.take (l-n) xs) bwt xs = let l = Raw.length xs ys = rotate l xs in Raw.pack $ map (Raw.last . rotate l xs) $ sortBy (\n m -> compare (ys n) (ys m)) [0..(l-1)] Now I can transform 52 KB in 54 seconds + 30 MB RAM. Still nowhere near C++, but a big improvement none the less. Woah... What the hell? I just switched to Data.ByteString.Lazy and WHAM! Vast speed increases... Jeepers, I can transform 52 KB so fast I can't even get to Task Manager fast enough to *check* the RAM usage! Blimey... OK, just tried the 145 KB test file that Mr C++ used. That took 2 seconds + 43 MB RAM. Ouch.

On Fri, Jun 22, 2007 at 09:26:40PM +0100, Andrew Coppin wrote:
OK, so I *started* writing a request for help, but... well I answered my own question! See the bottom... ... Unfortunately, the resident C++ expert fails to grasp the concept of "example code", and insists on comparing the efficiency of this program to the C one on the website. ... Fact is, he's translated the presented C into C++, and it can apparently transform a 145 KB file in 8 seconds using only 3 MB of RAM. The code ... Woah... What the hell? I just switched to Data.ByteString.Lazy and WHAM! Vast speed increases... Jeepers, I can transform 52 KB so fast I can't even get to Task Manager fast enough to *check* the RAM usage! Blimey...
OK, just tried the 145 KB test file that Mr C++ used. That took 2 seconds + 43 MB RAM. Ouch.
Mr. C++ apparently isn't a very good C++ programmer, since his best effort absolutely *pales* in comparison to Julian Seward's BWT: stefan@stefans:/usr/local/src/hpaste$ head -c 135000 /usr/share/dict/words | (time bzip2 -vvv) > /dev/null (stdin): block 1: crc = 0x25a18961, combined CRC = 0x25a18961, size = 135000 0 work, 135000 block, ratio 0.00 135000 in block, 107256 after MTF & 1-2 coding, 61+2 syms in use initial group 6, [0 .. 0], has 20930 syms (19.5%) initial group 5, [1 .. 1], has 4949 syms ( 4.6%) initial group 4, [2 .. 2], has 20579 syms (19.2%) initial group 3, [3 .. 4], has 17301 syms (16.1%) initial group 2, [5 .. 10], has 24247 syms (22.6%) initial group 1, [11 .. 62], has 19250 syms (17.9%) pass 1: size is 127140, grp uses are 339 550 192 440 12 613 pass 2: size is 51693, grp uses are 321 440 288 316 139 642 pass 3: size is 51358, grp uses are 329 387 376 304 122 628 pass 4: size is 51302, grp uses are 298 421 397 304 125 601 bytes: mapping 21, selectors 433, code lengths 110, codes 51297 final combined CRC = 0x25a18961 2.602:1, 3.075 bits/byte, 61.57% saved, 135000 in, 51887 out. real 0m0.165s user 0m0.044s sys 0m0.012s Yup, does slightly more work (huffman coding) in 1/200 the time :) (Note, on my system .Lazy BWT3 takes 5.3s on the same input) Stefan

Stefan O'Rear wrote:
Mr. C++ apparently isn't a very good C++ programmer, since his best effort absolutely *pales* in comparison to Julian Seward's BWT:
stefan@stefans:/usr/local/src/hpaste$ head -c 135000 /usr/share/dict/words | (time bzip2 -vvv) > /dev/null (stdin): block 1: crc = 0x25a18961, combined CRC = 0x25a18961, size = 135000 0 work, 135000 block, ratio 0.00 135000 in block, 107256 after MTF & 1-2 coding, 61+2 syms in use initial group 6, [0 .. 0], has 20930 syms (19.5%) initial group 5, [1 .. 1], has 4949 syms ( 4.6%) initial group 4, [2 .. 2], has 20579 syms (19.2%) initial group 3, [3 .. 4], has 17301 syms (16.1%) initial group 2, [5 .. 10], has 24247 syms (22.6%) initial group 1, [11 .. 62], has 19250 syms (17.9%) pass 1: size is 127140, grp uses are 339 550 192 440 12 613 pass 2: size is 51693, grp uses are 321 440 288 316 139 642 pass 3: size is 51358, grp uses are 329 387 376 304 122 628 pass 4: size is 51302, grp uses are 298 421 397 304 125 601 bytes: mapping 21, selectors 433, code lengths 110, codes 51297 final combined CRC = 0x25a18961 2.602:1, 3.075 bits/byte, 61.57% saved, 135000 in, 51887 out.
real 0m0.165s user 0m0.044s sys 0m0.012s
Yup, does slightly more work (huffman coding) in 1/200 the time :)
(Note, on my system .Lazy BWT3 takes 5.3s on the same input)
...OK...so how do I make Haskell go faster still? Presumably by transforming the code into an ugly mess that nobody can read any more...?

...OK...so how do I make Haskell go faster still?
Presumably by transforming the code into an ugly mess that nobody can read any more...?

Hello Andrew, Saturday, June 23, 2007, 11:21:26 AM, you wrote:
...OK...so how do I make Haskell go faster still?
Presumably by transforming the code into an ugly mess that nobody can read any more...?
bwt transformation is very good researched area, so probably you will not get decent performance (megabytes per second) without lot of work. and of course, no Haskell at all. take look at http://darchiver.narod.ru/dark/Archon3fs.zip -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Andrew,
bwt transformation is very good researched area, so probably you will not get decent performance (megabytes per second) without lot of work.
Hey, I'm just glad I managed to get within striking distance of Mr C++. So much for Haskell being "inherently less performant". :-P
and of course, no Haskell at all. take look at http://darchiver.narod.ru/dark/Archon3fs.zip
Mmm, ok...

Hello Andrew, Saturday, June 23, 2007, 2:45:01 PM, you wrote:
Hey, I'm just glad I managed to get within striking distance of Mr C++. So much for Haskell being "inherently less performant". :-P
my little analysis says that it's probably due to different sort() implementations, so this says nothing about general case -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Andrew,
Saturday, June 23, 2007, 2:45:01 PM, you wrote:
Hey, I'm just glad I managed to get within striking distance of Mr C++. So much for Haskell being "inherently less performant". :-P
my little analysis says that it's probably due to different sort() implementations, so this says nothing about general case
The point being that lots of people look at Haskell and go "oh, that's very cute for writing trivial example code, but it can never be fast; for that you must use C or C++". Well, I altered the code, and it's *still* very short and very readable, and it's just as fast as the (3 pages long) C++ version. >:-D

On Saturday 23 June 2007 13:02:54 Andrew Coppin wrote:
Well, I altered the code, and it's *still* very short and very readable, and it's just as fast as the (3 pages long) C++ version. >:-D
Indeed. The performance of modern functional programming languages never ceases to amaze me. INRIA typically claim performance within 2x of C for OCaml but there are very few programs where OCaml isn't within 50% now, at least on AMDs. Most of the pedagogical examples of functional programming (e.g. n-queens) are too academic for general consumption but there are plenty of excellent examples. Burrows-Wheeler is a great one. Ray tracing is another. Sudoku solving is fairy good, albeit unusually well suited to arrays. I also like to compare the performance of term-level interpreters or rewriters implemented in different languages. I think it would be interesting to compare Scheme/Mathematica interpreters written in OCaml, SML (MLton) and Haskell, for example. I've benchmarked some interpreters in OCaml and SML (MLton) and MLton's whole-program optimizations are a huge benefit here, making it several times faster than OCaml. I'd like to know how well Haskell would do. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. The OCaml Journal http://www.ffconsultancy.com/products/ocaml_journal/?e

Hello Andrew, Saturday, June 23, 2007, 4:02:54 PM, you wrote:
The point being that lots of people look at Haskell and go "oh, that's very cute for writing trivial example code, but it can never be fast; for that you must use C or C++".
and that's true :) as i said, your C++ code is very far from perfect. by comparing with bad code you can find that even Logo is faster than asm -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Fri, Jun 22, 2007 at 09:26:40PM +0100, Andrew Coppin wrote:
OK, so I *started* writing a request for help, but... well I answered my own question! See the bottom...
...
module BWT3 (bwt) where
import Data.List import qualified Data.ByteString as Raw
rotate :: Int -> Raw.ByteString -> Int -> Raw.ByteString rotate l xs n = (Raw.drop (l-n) xs) `Raw.append` (Raw.take (l-n) xs)
bwt xs = let l = Raw.length xs ys = rotate l xs in Raw.pack $ map (Raw.last . rotate l xs) $ sortBy (\n m -> compare (ys n) (ys m)) [0..(l-1)]
Now I can transform 52 KB in 54 seconds + 30 MB RAM. Still nowhere near C++, but a big improvement none the less.
The trouble is that Raw.append is an O(N) operation, making the computation O(N^2) where it ought to be O(NlogN).
Woah... What the hell? I just switched to Data.ByteString.Lazy and WHAM! Vast speed increases... Jeepers, I can transform 52 KB so fast I can't even get to Task Manager fast enough to *check* the RAM usage! Blimey...
OK, just tried the 145 KB test file that Mr C++ used. That took 2 seconds + 43 MB RAM. Ouch.
In this case append is an O(1) operation. But you're still getting killed on prefactors, because you're generating a list of size N and then sorting it. Lists are just not nice data structures to sort, nor are they nice to have for large N. To get better speed and memory use, I think you'd want to avoid the intermediate list in favor of some sort of strict array, but that'd be ugly. -- David Roundy Department of Physics Oregon State University

David Roundy wrote:
On Fri, Jun 22, 2007 at 09:26:40PM +0100, Andrew Coppin wrote:
OK, so I *started* writing a request for help, but... well I answered my own question! See the bottom...
....
module BWT3 (bwt) where
import Data.List import qualified Data.ByteString as Raw
rotate :: Int -> Raw.ByteString -> Int -> Raw.ByteString rotate l xs n = (Raw.drop (l-n) xs) `Raw.append` (Raw.take (l-n) xs)
bwt xs = let l = Raw.length xs ys = rotate l xs in Raw.pack $ map (Raw.last . rotate l xs) $ sortBy (\n m -> compare (ys n) (ys m)) [0..(l-1)]
The trouble is that Raw.append is an O(N) operation, making the computation O(N^2) where it ought to be O(NlogN).
Well, it actually takes O(N^2 log N) time: N log N comparisons each taking O(N) time. With O(1) append, each comparison still has a worst-case bound of O(N) (take the degenerate input (repeat 'a')) but it will probably be much faster in practice. The log N factor can be squeezed to a constant by using a simple trie, but to be even faster, you need a clever suffix tree/array. IIRC, O(N) is the best worst case bound for the burrows wheeler transform although they're said to be not so good in practice. Note that the one usually adds an "end of string" character $ in the Burrows-Wheeler transform for compression such that sorting rotated strings becomes sorting suffices. Concerning the algorithm at hand, you can clearly avoid calculating Raw.append over and over: bwt :: Raw.ByteString -> Raw.ByteString bwt xs = Raw.pack . map (Raw.last) . sort $ rotations where n = length xs rotations = take n . map (take n) . tails $ xs `Raw.append` xs assuming that take n is O(1).
Woah... What the hell? I just switched to Data.ByteString.Lazy and WHAM! Vast speed increases... Jeepers, I can transform 52 KB so fast I can't even get to Task Manager fast enough to *check* the RAM usage! Blimey...
OK, just tried the 145 KB test file that Mr C++ used. That took 2 seconds + 43 MB RAM. Ouch.
In this case append is an O(1) operation. But you're still getting killed on prefactors, because you're generating a list of size N and then sorting it. Lists are just not nice data structures to sort, nor are they nice to have for large N.
To get better speed and memory use, I think you'd want to avoid the intermediate list in favor of some sort of strict array, but that'd be ugly.
Eh? You can't really avoid the list for sorting, in the sense that before trying an array/in-place sort, you'd better use a trie. Regards, apfelmus

apfelmus wrote:
Note that the one usually adds an "end of string" character $ in the Burrows-Wheeler transform for compression such that sorting rotated strings becomes sorting suffices.
Yeah, I noticed that the output from by program can never actually be reverted to its original form. ;-) Maybe it I alter the code to stick a 0-byte in there or something... (Hmm, I wonder how Mr C++ managed it? Heh. If I could read C++, maybe I'd know...)
Concerning the algorithm at hand, you can clearly avoid calculating Raw.append over and over:
bwt :: Raw.ByteString -> Raw.ByteString bwt xs = Raw.pack . map (Raw.last) . sort $ rotations where n = length xs rotations = take n . map (take n) . tails $ xs `Raw.append` xs
assuming that take n is O(1).
...which is amusing because that's what the *first* implementation did. ;-) I was trying to avoid O(n^2) RAM usage. :-}

Andrew Coppin wrote:
apfelmus wrote:
Note that the one usually adds an "end of string" character $ in the Burrows-Wheeler transform for compression such that sorting rotated strings becomes sorting suffices.
Yeah, I noticed that the output from by program can never actually be reverted to its original form.
Well it can, but that's a different story told in Richard S. Bird and Shin-Cheng Mu. Inverting the Burrows-Wheeler transform. http://web.comlab.ox.ac.uk/oucl/work/richard.bird/publications.html #DBLP:journals/jfp/BirdM04 Oh, and you had a function inv_bwt, right?
Concerning the algorithm at hand, you can clearly avoid calculating Raw.append over and over:
bwt :: Raw.ByteString -> Raw.ByteString bwt xs = Raw.pack . map (Raw.last) . sort $ rotations where n = length xs rotations = take n . map (take n) . tails $ xs `Raw.append` xs
assuming that take n is O(1).
I was trying to avoid O(n^2) RAM usage. :-}
Note that for ByteStrings, this takes only O(n) RAM because the substrings are shared. But for lists, this would take O(n^2) RAM since (take n) cannot share hole sublists. An O(n) choice for lists that doesn't recalculate ++ all the time would be bwt :: Ord a => [a] -> [a] bwt xs = map last . sortBy (compare `on` (take n)) $ rotations where n = length xs rotations = take n . tails $ xs ++ xs with the well-known on :: (a -> a -> c) -> (b -> a) -> (b -> b -> c) on g f x y = g (f x) (f y) Regards, apfelmus

apfelmus wrote:
Andrew Coppin wrote:
Yeah, I noticed that the output from by program can never actually be reverted to its original form.
Well it can, but that's a different story told in
Richard S. Bird and Shin-Cheng Mu. Inverting the Burrows-Wheeler transform. http://web.comlab.ox.ac.uk/oucl/work/richard.bird/publications.html #DBLP:journals/jfp/BirdM04
Oh, and you had a function inv_bwt, right?
Implementation #1 had an inverse - but it works by finding an unused character and prepending that to the input before doing the forward BWT. ;-) For the other implementations, there is no inverse at all.
I was trying to avoid O(n^2) RAM usage. :-}
Note that for ByteStrings, this takes only O(n) RAM because the substrings are shared. But for lists, this would take O(n^2) RAM since (take n) cannot share hole sublists.
Presumably that's only true for *strict* ByteStrings? (I still don't actually understand how lazy bytestrings are any different to normal lists - or why they produced such a massive performance increase...)
An O(n) choice for lists that doesn't recalculate ++ all the time would be
bwt :: Ord a => [a] -> [a] bwt xs = map last . sortBy (compare `on` (take n)) $ rotations where n = length xs rotations = take n . tails $ xs ++ xs
with the well-known
on :: (a -> a -> c) -> (b -> a) -> (b -> b -> c) on g f x y = g (f x) (f y)
Interesting... But how does it perform? ;-)

Andrew Coppin wrote:
apfelmus wrote:
Note that the one usually adds an "end of string" character $ in the Burrows-Wheeler transform for compression such that sorting rotated strings becomes sorting suffices.
Yeah, I noticed that the output from by program can never actually be reverted to its original form. ;-) Maybe it I alter the code to stick a 0-byte in there or something...
To recover the original string you either need to store the offset to the first character or add a sentinel character to mark the string end. One advantage of the sentinel character approach is that it's sufficient to sort just the suffixes of the text. A disadvantage is that you need an additional character. The code below reserves \0 for this purpose.
Code: >>>
"bwt" implements a variation of the Burrows-Wheeler transform, using \0 as a sentinel character for simplicity. The sentinel has to be smaller than all other characters in the string.
bwt xs = let suffixes = [(a,as) | a:as <- tails ('\0':xs)] in map fst . sortBy (\(_,a) (_,b) -> a `compare` b) $ suffixes
"rbwt" implements the corresponding inverse BWT. It's a fun knot tying exercise.
rbwt xs = let res = sortBy (\(a:as) (b:bs) -> a `compare` b) (zipWith' (:) xs res) in tail . map snd . zip xs $ head res
"zipWith'" is a variant of zipWith that asserts that the third argument has the same shape as the second one.
zipWith' f [] _ = [] zipWith' f (x:xs) ~(y:ys) = f x y : zipWith' f xs ys
<<< End Code <<< Both transforms use linear memory (but quite a lot, admittedly). regards, Bertram

I enjoy code like this that requires laziness. My modified version of your code is below... Bertram Felgenhauer wrote:
Code: >>>
"bwt" implements a variation of the Burrows-Wheeler transform, using \0 as a sentinel character for simplicity. The sentinel has to be smaller than all other characters in the string.
bwt xs = let suffixes = [(a,as) | a:as <- tails ('\0':xs)] in map fst . sortBy (\(_,a) (_,b) -> a `compare` b) $ suffixes
"rbwt" implements the corresponding inverse BWT. It's a fun knot tying exercise.
rbwt xs = let res = sortBy (\(a:as) (b:bs) -> a `compare` b) (zipWith' (:) xs res) in tail . map snd . zip xs $ head res
"zipWith'" is a variant of zipWith that asserts that the third argument has the same shape as the second one.
zipWith' f [] _ = [] zipWith' f (x:xs) ~(y:ys) = f x y : zipWith' f xs ys
<<< End Code <<
I did not like the look of (map snd . zip xs) since it looks like a no-op (that constructs a useless (,) which may or may not be elided by a sufficiently smart compiler). But it is using the fact that xs is finite and (head res) is not to do "take (length xs) $ head res" without the extra traversal and math. But one can abuse (zipWith' (flip const)) for a 'rbwt' appeals to me more:
import Data.List
f `on` g = \x y -> (g x) `f` (g y)
zipWith' f [] _ = [] zipWith' f (x:xs) ~(y:ys) = f x y : zipWith' f xs ys
bwt = map head . sortBy (compare `on` tail) . init . tails . ('\0':)
rbwt xs = tail $ zipWith' (flip const) xs $ head res where res = sortBy (compare `on` head) (zipWith' (:) xs res)
While I was removing (,) from the 'rbwt' I went ahead and removed it from 'bwt' as well. This was, of course, pointless... Thanks for the fun example, Chris

On 6/23/07, Bertram Felgenhauer
"rbwt" implements the corresponding inverse BWT. It's a fun knot tying exercise.
rbwt xs = let res = sortBy (\(a:as) (b:bs) -> a `compare` b) (zipWith' (:) xs res) in tail . map snd . zip xs $ head res
Indeed it's very fun =). Although I must admit that, even after staring at the code for some time, I still don't see how it works. Is it too much to ask for a brief explanation? I see that zipWith' creates all the lazy list without requiring any knowledge of the second list, which means that the list to be sorted is created nicely. But when sortBy needs to compare, say, the first with the second items, it forces the thunk of the lazy list. But that thunk depends on the sorted list itself! What am I missing? ;) Thanks in advance, -- Felipe.

Felipe Almeida Lessa wrote:
On 6/23/07, Bertram Felgenhauer
wrote: "rbwt" implements the corresponding inverse BWT. It's a fun knot tying exercise.
rbwt xs = let res = sortBy (\(a:as) (b:bs) -> a `compare` b) (zipWith' (:) xs res) in tail . map snd . zip xs $ head res
Indeed it's very fun =). Although I must admit that, even after staring at the code for some time, I still don't see how it works. Is it too much to ask for a brief explanation?
I see that zipWith' creates all the lazy list without requiring any knowledge of the second list, which means that the list to be sorted is created nicely. But when sortBy needs to compare, say, the first with the second items, it forces the thunk of the lazy list. But that thunk depends on the sorted list itself!
What am I missing? ;)
Thanks in advance,
I just read and understood it. I'll walk through my version:
import Data.List
f `on` g = \x y -> (g x) `f` (g y)
zipWith' f [] _ = [] zipWith' f (x:xs) ~(y:ys) = f x y : zipWith' f xs ys
bwt = map head . sortBy (compare `on` tail) . init . tails . ('\0':)
rbwt xs = tail $ zipWith' (flip const) xs $ head res where res = sortBy (compare `on` head) (zipWith' (:) xs res)
The key is that sort and sortBy are very very lazy. 'res' is a finite list of infinite strings. Consider sorting such a finite list of infinite lists without the recursive definition:
example = let a = [1,3..] b = [1,2..] c = [2,4..] d = [1,4..] sorted = sort [a,b,c,d] firstFiveOfEach = map (take 5) sorted in mapM_ print firstFiveOfEach
Which in ghci produces:
*Main> example [1,2,3,4,5] [1,3,5,7,9] [1,4,7,10,13] [2,4,6,8,10]
Now for a sidetrack:
example = let a = [1,3..] b = [1,2..] c = [2,4..] d = [1,4..] sorted = sort [a,b,c,d,c] firstFiveOfEach = map (take 5) sorted in mapM_ print firstFiveOfEach
Produces the first three elements before 'c' ....
*Main> example [1,2,3,4,5] [1,3,5,7,9] [1,4,7,10,13]
...and then hangs because "compare c c" never terminates. This lazy sorting is also why "head . sort $ foo" is an O(N) way to get the minimum of a list "foo" in Haskell. So the code for rbwt depends on this lazy sorting behavior. The recursive nature of the definition is the difficult to (directly) invent part. The result of (zipWith' (:) xs ___) is a list the same length as 'xs' and where each item is a list with the head item taken from 'xs'. So it has "primed the pump" of the recursive definition by giving the first Char in each of the strings. Compare with the much simpler 'ones' and 'odds': ones = 1 : ones odds = 1 : map (2+) odds What about the tails of all these lists? They come from res. And res comes from sorting the results of zipWith'. But the sorting is done only by looking at the head element, which is the one that comes from xs. That is the "sortBy (\(a:as) (b:bs) -> a `compare` b)" or "sortBy (compare `on` head)". So the sorting routine does not care about the ___ in (zipWith' (:) xs ___) at all, since it only examines the head which comes from xs.

Felipe Almeida Lessa wrote:
On 6/23/07, Bertram Felgenhauer
wrote: "rbwt" implements the corresponding inverse BWT. It's a fun knot tying exercise.
rbwt xs = let res = sortBy (\(a:as) (b:bs) -> a `compare` b) (zipWith' (:) xs res) in tail . map snd . zip xs $ head res
Indeed it's very fun =). Although I must admit that, even after staring at the code for some time, I still don't see how it works. Is it too much to ask for a brief explanation?
I see that zipWith' creates all the lazy list without requiring any knowledge of the second list, which means that the list to be sorted is created nicely. But when sortBy needs to compare, say, the first with the second items, it forces the thunk of the lazy list. But that thunk depends on the sorted list itself!
What am I missing? ;)
Mainly the precise effect of the irrefutable pattern in zipWith', I think:
zipWith' f [] _ = [] zipWith' f (x:xs) ~(y:ys) = f x y : zipWith' f xs ys
The ~(y:ys) does not pattern match immediately; instead it asserts that the argument will have a (:) constructor later, when either y or ys are forced. The last line is equivalent to
zipWith' f (x:xs) ys' = let (y:ys) = ys' in f x y : zipWith' f xs ys
which may be easier to understand. The end effect is that
map head $ zipWith' (:) [1..3] undefined
returns [1..3], never touching the "undefined" value at all. Trying
zipWith' [1..] [[]]
is also instructive: [[1],[2*** Exception: bwt.hs:(18,0)-(19,51): Irrefutable pattern failed for pattern (y : ys) Note the point where the error happens. The second thing is that the comparison in "rbwt" only compares the first character from the result lists - that is, it really only sorts the list xs, annotated with some additional information that is not available yet (isn't lazy evaluation great?). Once the sorting is done, the ~(y:ys) pattern matches of "zipWith'" all succeed. The effect is that a single circular list is formed from the characters in xs, with each rotation of the list occuring exactly once. In fact we get all rotations of the original text, in sorted order. This is by no means obvious, but is how the inverse BWT works. Hope that helps, Bertram P.S. Thanks Chris for your improvements :)

On Fri, 22 Jun 2007, Andrew Coppin wrote:
Woah... What the hell? I just switched to Data.ByteString.Lazy and WHAM! Vast speed increases... Jeepers, I can transform 52 KB so fast I can't even get to Task Manager fast enough to *check* the RAM usage! Blimey...
OK, just tried the 145 KB test file that Mr C++ used. That took 2 seconds + 43 MB RAM. Ouch.
A note re RAM usage: it behaves differently in a GCed environment, you might want to see if it runs with a smaller max heap size. Obviously you'll spend more time GCing. -- flippa@flippac.org "My religion says so" explains your beliefs. But it doesn't explain why I should hold them as well, let alone be restricted by them.

Andrew (>):
I was reading Wikipedia, and I found this:
Maybe it would be a nice idea to add back one of the implementations of the transform to the Wikipedia article... just to show how a Haskell version can be shorter, more readable and about as fast as a C/C++ version. // Carl
participants (12)
-
Andrew Coppin
-
apfelmus
-
Bertram Felgenhauer
-
Bulat Ziganshin
-
Carl Mäsak
-
Chris Kuklewicz
-
David Roundy
-
dons@cse.unsw.edu.au
-
Felipe Almeida Lessa
-
Jon Harrop
-
Philippa Cowderoy
-
Stefan O'Rear