[Haskell-cafe] Re: Slower with ByteStrings?

29 May 2007

      Mirko Rahn wrote:
...
...
...
from the letters of that word.  A letter can be used at most as many
times as it appears in the input word.  So, "letter" can only match
words with 0, 1, or 2 t's in them.
...
frequencies = map (\x -> (head x, length x)) . group . sort
   superset xs = \ys -> let y = frequencies ys in
        length y == lx &&
        and (zipWith (\(c,i) (d,j) -> c == d && i >= j) x y)
      where
      x  = frequencies xs
      lx = length x
As far as I understand the spec, this algorithm is not correct:
superset "ubuntu" "tun" == False
Is at least one 'b' necessary, yes or no?
Oops, you are indeed right, the answer should be "no". I thought I'd
came away without primitive recursion, but here's a correct version

  superset xs = superset' x . sort ys
    where
    x = sort xs

    _      `superset`  []     = True
    []     `superset`  _      = False
    (x:xs) `superset'` (y:ys)
        | x == y    = xs `superset` ys
        | x <  y    = xs `superset` (y:ys)
        | otherwise = False
...
If the answer is no, the
following algorithm solves the problem and is faster then the one above:
del y = del_acc []
    where del_acc _ []              = mzero
      del_acc v (x:xs) | x == y = return (v++xs)
      del_acc v (x:xs)          = del_acc (x:v) xs
super u = not . null . foldM (flip del) u
main = interact $ unlines . filter ("ubuntu" `super`) . lines
The algorithm is correct but it's not faster, xs `super` ys  takes
O(n*m) time whereas superset takes O(n * log n + m * log m) time given a
proper sorting algorithm. Here, n = length xs and m = length ys.

Actually, both algorithms are essentially the same except for the
sorting that allows to drop some equality tests.

(Note that memoizing x = sort xs over different ys speeds things up a
bit for the intended application. This way, (sort "ubuntu") is only
computed once and the running time over many ys approaches O(n + m*log m).)

Regards,
apfelmus

PS: Some exercises for the interested reader:
1) Still, the algorithm super has an advantage over superset. Which one?
2) Put xs into a good data structure and achieve a O(m * log n) time for
multiple ys.
3) Is this running time always better than the aforementioned O(n +
m*log m)? What about very large m > n?