
Hi, Toby Hutton wrote:
Say I want to put the words 'foo', 'bar' and 'baz' into a binary tree. The heuristic requires I split the words into letters and sort them: 'aabbfoorz'. The heuristic then may decide, based on the sorted letters, that 'bar' and 'foo' should go in the left child and 'baz' goes in the right. Typically we'd then simply recurse and for example, the left child's words would be re-sorted into 'abfoor' and the heuristic is reapplied.
If we assume that sorting is relatively expensive, we can avoid the re-sort for the children by unmerging the parent's sorted list of letters. Two sublists of a sorted list should already be sorted. If we know which word each letter belongs to it would be more efficient to tag the letters with 'left' or 'right' as the words are classified. Then we can iterate down the sorted letter list and produce new sorted sublists rather simply.
So it's not actually that complicated, and I can imagine exactly how it could be done in C but I really don't know how to approach this in Haskell.
What about just storing with each character a "reference" to the word it orignally comes from?
import Data.List (sortBy)
data Tagged = Tag String Char
tag :: Tagged -> String tag (Tag x _) = x
compareTagged :: Tagged -> Tagged -> Ordering (Tag _ x) `compareTagged` (Tag _ y) = x `compare` y
tagWord :: String -> [Tagged] tagWord word = map (Tag word) word
unmerge :: [a] -> (a -> Bool) -> ([a], [a]) unmerge xs p = foldr f ([], []) xs where f x (ls, rs) | p x = (x:ls, rs) | otherwise = (ls, x:rs)
data Tree = Empty | Leaf String | Node Tree Tree
tree :: ([Tagged] -> String -> Bool) -> [String] -> Tree tree heuristic words = tree' words sorted where
-- we only need to sort once ... sorted = sortBy compareTagged . concat . map tagWord $ words
tree' [] _ = Empty tree' [word] _ = Leaf word
tree' words sorted = let predicate = heuristic sorted (leftWords, rightWords) = unmerge words predicate -- ... because we reuse the ordered list by unmerging it (leftSorted, rightSorted) = unmerge sorted (predicate . tag) in Node (tree' leftWords leftSorted) (tree' rightWords rightSorted)
Tillmann