Re: Space Efficiency When Sorting a List of Many Lists

3. Re: Space Efficiency When Sorting a List of Many Lists (Heinrich Apfelmus)
Another interesting problem is starting from a file of single wagers and trying to compress them (i.e. inverse of 'explosion'. I believe this problem is analogous to Set Cover and therefore NP-Complete. Heuristics are the order of the day here.
Interesting indeed! What kind of heuristics do you employ there? It seems quite difficult to me, somehow.
one starts to see various interesting hypercube type structures in
One approach I *did* consider early on was to define a kind of 'Pseudo Hamming Distance': i.e. define PHD such that any two single combination wagers differing in one leg's entrant number have PSD = 1 e.g. 1, 2, 3, 4 --A 1, 1, 3, 4 --B 5, 1, 3, 4 --C A and B have PHD = 1 B and C have PHD = 1 A and C have PHD = 2, but PSD > 1 is not of interest here. Construct an adjacency list for all PHD = 1 -> an undirected graph where nodes are combinations and edges link combinations with PSD = 1- the graphviz plots which are analogous to grouped wagers. Easy enough to see this by working backwards from (say) 1+2+2/4+5+6/7+8+9/10+11+12, exploding it by CP, constructing the PHD=1 adjacency list, turning this into a DOT file and firing up graphviz. These graphviz pictures look very lovely to the human eye. Unfortunately, computational extraction of such features in graph spaghetti land, let-alone the issue of extracting maximal features seems to be infeasible before the arrival of Quantum Computers, Maxwell's Demons in every car engine, and Unicorns in the back garden. I also considered a set cover approach, but got a little dispirited when I sat down to compute the number of potential covering sets - CP of power sets of race field sizes! Essentially wager generation in multi-leg races works like this: (1) We have estimated win probabilities for the entrants in each of the race legs. Assume that I have a black box which produces these estimated probabilities. (2) We generate the CP of the entrant numbers in the race fields: Leg 1 Entants x Leg 2 Entrants x.. For each combination generated by the CP, we apply some formula to the probabilities associated with combination entrant numbers to arrive at a decision about whether or not that particular combination represents a viable wager. (3) If a combination does represent a viable wager, we then need to ensure that we have sized its wager amount in a correct and optimal manner. This gives output looking like: 1, 1, 2, 3 - $1.10 1, 1, 2, 4 - $1.20 . . 15, 15, 15, 12, - $0.35 Unfortunately these are hard to compress well - and that is ignoring issues of what to do about differing wager sizes when we can match two combinations for merging together. So one possible hack is to insert step 1(a) where we k-means cluster the win probabilities in each leg, such that we end up with multiple entrant numbers per cluster mean (which is a probability). We then proceed through steps 2 and 3, but instead of generating single combinations, we are now directly producing wagers which look like: 1+6+12/3+4+7/13+14 - $1.20 In this case, k-means clustering will have clustered 1,6,12 together with cluster mean/probability of (say) 0.1, etc. This a bit of a hack, as clearly it will result in the wagers not covering exactly the same single combinations as if we had generated pure single wagers without k-means clustering. Hence my interest in doing set comparisons! Another brute force greedy approach is to start with single combinations and do something like this: def merge_combinations(combinations): """Greedily merge all combinations on one field, then next field, etc. Parameter combinations is structured so: [[[1], [2], [3], [4], [5], [6], [7]], [[1], [2], [3], [4], [5], [6], [8]]] The input combinations are modified *in place*. After calling this method, combinations would become: [[[1], [2], [3], [4], [5], [6], [7,8]]] """ def rotate_left(combi): combi.append(combi.pop(0)) for do_it_seven_times in xrange(7): combinations.sort() k = 1 while k < len(combinations): if combinations[k-1][:6] == combinations[k][:6]: combinations[k-1][6].extend(combinations.pop(k)[6]) combinations[k-1][6].sort() else: k += 1 map(rotate_left, combinations) def merge_demo(): from pprint import pprint combis = [[[1], [1], [1], [2], [2], [12], [1]], [[1], [1], [1], [2], [6], [3], [1]], [[1], [1], [1], [2], [6], [6], [1]], [[1], [1], [1], [2], [6], [7], [1]], [[1], [1], [1], [2], [6], [7], [2]], [[1], [1], [1], [2], [6], [12], [1]], [[1], [1], [1], [2], [8], [7], [1]], [[1], [1], [1], [2], [8], [12], [1]], [[1], [1], [1], [2], [9], [3], [1]], [[1], [1], [1], [2], [9], [7], [1]], [[1], [1], [1], [2], [9], [12], [1]], [[1], [1], [1], [3], [2], [7], [1]], [[1], [1], [1], [3], [2], [12], [1]]] pprint(sorted(combis)) # 13 print len(combis) merge_combinations(combis) pprint(sorted(combis)) # -> # [[[1], [1], [1], [2], [6], [6], [1]], # [[1], [1], [1], [2], [6], [7], [1, 2]], # [[1], [1], [1], [2], [6, 8, 9], [12], [1]], # [[1], [1], [1], [2], [6, 9], [3], [1]], # [[1], [1], [1], [2], [8, 9], [7], [1]], # [[1], [1], [1], [2, 3], [2], [12], [1]], # [[1], [1], [1], [3], [2], [7], [1]]] # 7 print len(combis) print print combis = [[[13], [6, 10, 13], [3, 6], [4], [2, 8], [7], [1]], [[13], [6, 10, 13], [3, 6], [4], [9], [7], [1]], [[13], [6, 10, 13], [3, 6], [4], [6], [12], [1]], [[13], [6, 10, 13], [3, 6], [4], [6], [7], [1]], [[13], [6, 10, 13], [5, 10], [10], [2, 8], [7], [1]], [[13], [6, 10, 13], [5, 10], [10], [9], [7], [1]], [[2, 9], [11], [7, 13], [6, 12], [9], [12], [3]], [[2, 9], [11], [7, 13], [6, 12], [9], [12], [2]], [[2, 9], [11], [7, 13], [6, 12], [9], [12], [1]], [[2, 9], [11], [7, 13], [6, 12], [9], [7], [5, 8, 14]], [[2, 9], [11], [7, 13], [6, 12], [9], [7], [7]], [[2, 9], [11], [7, 13], [6, 12], [9], [7], [4]]] pprint(combis) # 12 print len(combis) merge_combinations(combis) # -> # [[[13], [6, 10, 13], [3, 6], [4], [2, 6, 8, 9], [7], [1]], # [[13], [6, 10, 13], [3, 6], [4], [6], [12], [1]], # [[13], [6, 10, 13], [5, 10], [10], [2, 8, 9], [7], [1]], # [[2, 9], [11], [7, 13], [6, 12], [9], [12], [1, 2, 3]], # [[2, 9], [11], [7, 13], [6, 12], [9], [7], [4, 5, 7, 8, 14]]] pprint(combis) # 5 print len(combis) I wrote a version of this in PLT Scheme also a while back. IIRC, I had to flip everything around and perform actions at the head of lists instead of at the tail of Python lists (more like arrays really). Will have to try my hand at a Haskell version later and see how it goes - although I have the suspicion that feeding such a routine 1M single combinations is going to run into all sorts of space issues. Unfortunately this approach never seems to result in more than a 10 x compression factor. Can do much better with the k-means approach - at the cost of missing some theoretically desirable combinations and betting on some spurious combinations. As always, would be very interested to hear if you have any ideas/ insights on first seeing this kind of problem. I'm almost certainly too fixed in my thinking due to long association with the wagering problem domain and probably cannot see the wood for the trees!
What I mean is that there are two ways to represent sets of wagers:
* A list of single combinations -> [Row a] * A list of compressed wagers -> [Row [a]]
To perform set operations, you first convert the latter into the former. But the idea is that there might be a way of performing set operations without doing any conversion.
Namely, the compressed wagers are cartesian products which behave well under intersection: the intersection of two compressed wagers is again a compressed wager
intersectC :: Eq a => Row [a] -> Row [a] -> Row [a] intersectC [] [] = [] intersectC (xs:xrow) (ys:yrow) = intersect xs ys : intersectC xrow yrow
In formulas:
intersect (cartesian x) (cartesian y) = cartesian (intersectC x y)
for compressed wagers x, y :: Row [a] with
intersect = list intersection, for instance Data.List.intersect cartesian = converts a compressed wager to a list of single combinations
This allows you to intersect two lists of compressed wagers directly
intersect' :: [Row [a]] -> [Row [a]] -> [Row [a]] intersect' xs ys = filter (not . null) $ [intersectC x y | x<-xs, y<-ys] Unfortunately, the result will have quadratic size; but if you have significantly less compressed wagers than single combinations, this may be worth a try. Not to mention that you might be able to compress the result again.
Thank you for the explanation. This approach is going to be quite useful for most reasonable sized input sets. I'm thinking of running this as part of a suite of web services in something like a JEOS instance with relatively limited RAM - so there are advantages to avoiding exploding to singles and using 100+MB with a Map. import Data.List (intersect) import Data.List.Split (splitOn) import System.Environment (getArgs) type Row a = [a] type CombisGroup = Row [Int] -- e.g. ["8+12,11,7+13,10","1+2+3,1+9,3+6,4"] --> -- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]] fromCSV :: [String] -> [CombisGroup] fromCSV = map parseOneLine where parseOneLine = map parseGroup . splitOn "," where parseGroup = map read . splitOn "+" parseInput :: String -> [CombisGroup] parseInput = fromCSV . lines singleCombisCount :: [CombisGroup] -> Int singleCombisCount = sum . map cpCardinality -- e.g. CP xss yss having zero common elements: -- intersectC [[1,2],[1,3,4],[5]] [[1,2],[4,5,6],[3]] --> [[1,2],[4],[]] -- e.g. CP xss yss having two common elements [[1,4,5],[2,4,5]] : -- intersectC [[1,2],[3,4],[5]] [[1,2],[4,5,6],[3,5]] --> [[1,2],[4], [5]] intersectC :: Eq a => Row [a] -> Row [a] -> Row [a] intersectC [] [] = [] intersectC (xs:xrow) (ys:yrow) = intersect xs ys : intersectC xrow yrow -- e.g. no common elements: -- intersect' [[[1,2],[1,3,4],[5]],[[1,2],[7,11],[5]]] -- [[[1,2],[4,5,6],[3]],[[11],[7,11],[3,5]]] --> [] -- e.g. two common elements: -- intersect' [[[1,2],[1,3,9],[5]],[[1,2],[3,4],[5]]] -- [[[1,2],[4,5,6],[3,5]],[[11],[4,5,6],[3,5]]] --> -- [[[1,2],[4],[5]]] --> CP --> [[1,4,5], [2,4,5]] intersect' :: Eq a => [Row [a]] -> [Row [a]] -> [Row [a]] intersect' xs ys = filter (notElem []) [intersectC x y | x <- xs, y <- ys] -- e.g. [[8,12],[11],[7,13],[10]] --> 4 cpCardinality :: Row [a] -> Int cpCardinality = product . map length main = do [arg1, arg2] <- getArgs compressedA <- return . parseInput =<< readFile arg1 compressedB <- return . parseInput =<< readFile arg2 putStrLn $ "A source: " ++ arg1 putStrLn $ "B source: " ++ arg2 putStrLn $ "|A Intersect B|: " ++ show (singleCombisCount $ intersect' compressedA compressedB)

Peter Green wrote:
Another interesting problem is starting from a file of single wagers and trying to compress them (i.e. inverse of 'explosion'. I believe this problem is analogous to Set Cover and therefore NP-Complete. Heuristics are the order of the day here.
Interesting indeed! What kind of heuristics do you employ there? It seems quite difficult to me, somehow.
Essentially wager generation in multi-leg races works like this:
(1) We have estimated win probabilities for the entrants in each of the race legs. Assume that I have a black box which produces these estimated probabilities.
(2) We generate the CP of the entrant numbers in the race fields: Leg 1 Entants x Leg 2 Entrants x.. For each combination generated by the CP, we apply some formula to the probabilities associated with combination entrant numbers to arrive at a decision about whether or not that particular combination represents a viable wager.
(3) If a combination does represent a viable wager, we then need to ensure that we have sized its wager amount in a correct and optimal manner.
This gives output looking like:
1, 1, 2, 3 - $1.10 1, 1, 2, 4 - $1.20 .. .. 15, 15, 15, 12, - $0.35
Unfortunately these are hard to compress well - and that is ignoring issues of what to do about differing wager sizes when we can match two combinations for merging together.
So one possible hack is to insert step 1(a) where we k-means cluster the win probabilities in each leg, such that we end up with multiple entrant numbers per cluster mean (which is a probability). We then proceed through steps 2 and 3, but instead of generating single combinations, we are now directly producing wagers which look like:
1+6+12/3+4+7/13+14 - $1.20
In this case, k-means clustering will have clustered 1,6,12 together with cluster mean/probability of (say) 0.1, etc.
This a bit of a hack, as clearly it will result in the wagers not covering exactly the same single combinations as if we had generated pure single wagers without k-means clustering. Hence my interest in doing set comparisons!
Another brute force greedy approach is to start with single combinations and do something like this: [...]
Unfortunately this approach never seems to result in more than a 10 x compression factor. Can do much better with the k-means approach - at the cost of missing some theoretically desirable combinations and betting on some spurious combinations.
As always, would be very interested to hear if you have any ideas/insights on first seeing this kind of problem. I'm almost certainly too fixed in my thinking due to long association with the wagering problem domain and probably cannot see the wood for the trees!
I think that you're on the right track with trying to compress the output before actually generating it, i.e. compressing after step (1) instead of after step (3). My reasoning is that I don't see a good reason why, a priori, a *random* list of single wager combinations could be compressed well into *cartesian products*. I mean, if you have data without apparent structure, the best you can do is to pipe it through gzip , which has nothing to do with cartesian products at all. So, the fact that your "compression" works at all is a strong hint that your data does have a lot of structure, which you saw in the graphviz visualizations and which you are wisely exploiting at creation time with step (1a). (The greedy algorithm you mentioned is doomed to failure because it doesn't respect symmetries. The optimal result is invariant under * interchanging columns * renaming all numbers in one column but the greedy algorithm gives completely different results when you apply these operations to the input.) To achieve better compression, you have to open the black boxes in (1) and (2). The good news is that you don't need to open them completely; instead you just want to know properties like for example * Prob( entrant A wins in leg 1) = Prob( entrant A wins in leg 2) * Prob( entrant A wins) is independent of other participants * if {A,B} is a viable wager for leg 1 , so are {A} and {B} ("monotonicity") * Prob( A, B, C ) = Prob(A in leg 1) * Prob(B in leg 2) * Prob(C in leg 3) ("independence") * etc... In other words, the idea is that the probability for race outcome with multiple legs is, say, the product of the probabilities of the single legs, which then leads to these hypercube structures in your list of wagers. (I don't know which properties you need exactly because I don't understand the details about the probabilities, choosing wagers and pricing them.) Armed with such properties, there are systematic ways to exploit them for efficient algorithms. See for example the case study De Moor and Gibbons. Bridging the algorithm gap: A linear-time functional program for paragraph formatting. http://tinyurl.com/ydt5n5z (This is a .ps.gz file!) Understanding this might also lead to a better representation of compressed wagers that also includes wager sizes. In a sense, you're storing a recipe for creating a list of single wagers instead of the list itself, and [Row [a]] is just one possible "recipe format". Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com
participants (2)
-
Heinrich Apfelmus
-
Peter Green