[Haskell-cafe] Re: Space Efficiency When Sorting a List of Many Lists

9 Jan 2010

      Peter Green wrote:
...
Heinrich, thanks for some great help and food for thought!
My pleasure. :)
...
...
Thanks for describing the background of this problem in detail! I was
mainly asking because I'm always looking for interesting Haskell topics
that can be turned into a tutorial of sorts, and this problem makes a
great example.
Another interesting problem is starting from a file of single wagers and
trying to compress them  (i.e. inverse of 'explosion'. I believe this
problem is analogous to Set Cover and therefore NP-Complete. Heuristics
are the order of the day here.
Interesting indeed! What kind of heuristics do you employ there? It
seems quite difficult to me, somehow.
...
...
[...]
   main = interact (unlines . toCSV . sortExplode . fromCSV . lines)
Thank you *very* much for this code! I will try to get my head around
it. I understand the broad outline you posted elsewhere, but will take
me a while to fully grasp the above as I'm only up to ~p200 in Real
World Haskell :).
As for performance of your code above on my file of compressed wagers
which expands to 3.7M single wagers:
(My original version posted here and using vanilla sort)
541 MB total memory in use (5 MB lost due to fragmentation)
INIT  time    0.00s  (  0.01s elapsed)
  MUT   time   13.98s  ( 15.82s elapsed)
  GC    time    8.69s  (  9.64s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   22.67s  ( 25.47s elapsed)
(Your much improved version)
10 MB total memory in use (1 MB lost due to fragmentation)
  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    7.61s  (  9.38s elapsed)
  GC    time    3.48s  (  3.58s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   11.08s  ( 12.95s elapsed)
Very impressive and thanks again!
Category theory saves the day. ;)

The code is based on three observations:
1) Sorting and exploding can be interleaved to create version that can
stream results in an on-line fashion.
2) There is a standard formulation of this pattern in terms of folds and
unfolds (= catamorphisms and anamorphisms).
3) Lazy evaluation can be used to make this look like an off-line algorithm.

To my surprise, I don't have good references for these, but maybe the
following can help a bit. Something similar to 1) can be found in

  Graham Hutton. The countdown problem.
  http://www.cs.nott.ac.uk/~gmh/bib.html#countdown

and

  Richard Bird. A program to solve Sudoku.
  http://www.cs.tufts.edu/~nr/comp150fp/archive/richard-bird/sudoku.pdf

The standard reference for 2) is

  Meijer et al. Functional programming with bananas,
    lenses, envelopes and barbed wire.
  http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.125

and 3) is folklore, maybe

  John Hughes. Why functional programming matters.
  http://www.cs.chalmers.se/~rjmh/Papers/whyfp.html

could serve as an introduction to lazy evaluation.
...
...
Note that there's also the possibility of not expanding the compressed
wagers at all, and perform set operations directly. For instance, it is
straightforward to intersect two such sets of size n in O(n^2) time.
Since n ~ 5000 , n^2 is about the same ballpark as the exploded single
wager combinations.
Not sure I quite understand you here. In my mind, set elements *are*
single combinations. It is possible for two quite different-looking
files of compressed wagers to contain exactly the same single wager
elements. So I'm not sure how to compare without some notion of
explosion to single combinations.
What I mean is that there are two ways to represent sets of wagers:

* A list of single combinations ->  [Row a]
* A list of compressed wagers   ->  [Row [a]]

To perform set operations, you first convert the latter into the former.
But the idea is that there might be a way of performing set operations
without doing any conversion.

Namely, the compressed wagers are cartesian products which behave well
under intersection: the intersection of two compressed wagers is again a
compressed wager

   intersectC :: Eq a => Row [a] -> Row [a] -> Row [a]
   intersectC []        []        = []
   intersectC (xs:xrow) (ys:yrow) =
       intersect xs ys : intersectC xrow yrow

In formulas:

     intersect (cartesian x) (cartesian y)
   = cartesian (intersectC x y)

for compressed wagers  x, y :: Row [a]  with

   intersect = list intersection, for instance Data.List.intersect
   cartesian = converts a compressed wager to a list of
               single combinations

This allows you to intersect two lists of compressed wagers directly

   intersect' :: [Row [a]] -> [Row [a]] -> [Row [a]]
   intersect' xs ys = filter (not . null) $
                      [intersectC x y | x<-xs, y<-ys]

Unfortunately, the result will have quadratic size; but if you have
significantly less compressed wagers than single combinations, this may
be worth a try. Not to mention that you might be able to compress the
result again.

Mathematically, this exploits the equations

   (A × B) ∩ (C × D) = (A ∩ C) × (B ∩ D)

   "The intersection of two rectangles is another rectangle"

and

   (A ∪ B) ∩ (C ∪ D) = (A ∩ C) ∪ (A ∩ D)
                       ∪ (B ∩ C) ∪ (B ∩ D)

where A,B,C,D are sets.

Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com