I have text files containing 'compressed lists' Compressed lists
look like this:
8+12,11,7+13,10
1+2+3,1+9,3+6,4
.
.
Sublists are comma-delimited, and sublist elements are separated by
'+' character. Sublists contain integers in the range 1..20.
I parse these to look like so:
[[[8,12],[11],[7,13],[10]],
[[1,2,3],[1,9],[3,6],[4]],
.
.
]
I need to explode these and produce a lex-sorted list of exploded lists:
[[1,1,3,4],[1,1,6,4],[1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],
[2,9,6,4],[3,1,3,4],[3,1,6,4],[3,9,3,4],[3,9,6,4],
[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10]]
I then output this data as comma-delimited lists:
1,1,3,4
1,1,6,4
.
.
12,11,13,10
It is a property of the underlying problem 'structure' that none of these comma-delimited lists in the exploded data is repeated. i.e. we can never see two instances of (say) 1,1,3,4.
{ Begin Aside on Why I Would Want to do Such a Silly Thing:
'Compressed lists' are in fact compressed wager combinations for multi-leg exotic horse racing events. 'Exploded lists' are the single wager combinations covered by the grouped combinations.
Why use compressed combinations? Well, it's a lot easier to submit 5,000 compressed tickets (whether physically or electronically) than 1M single tickets for the individual combinations we have somehow managed to represent by the 5,000 compressed tickets.
However, single combinations are the currency of the realm. These are the underlying wagers and compression is merely a convenience/necessity to enable single wagers to be placed.
It's not uncommon to generate large numbers of single combinations: Imagine 6 races with 15 runners in each race, and a wager requiring one to select first place getters in all 6 legs. The number of potential outcomes is 15^6 = 11,390,625. One might decide (somehow) that it makes sense in a positive expectation way to wager (say) 500K of these potential outcomes.
Getting from theoretically desirable single combinations to optimally compressed wagers is actually NP-Complete and another story for another day. Suffice it to say that one might use some nasty hacks and heuristics to arrive at compressed wagers - but in the process doing violence to one's precise coverage of the theoretically desirable single combinations. That's one reason why I need to recover (by 'exploding') the *actual* wagered single combinations from the compressed/ wagers.
I need to explode these compressed wagers back to single combinations because, I need to (eventually) do set comparisons on collections of single combinations. i.e. given File A and File B of compressed wagers, I need to be able to answer questions like:
(1) How many single combinations are common to File A and File B
(2) How many single combinations in File A are missing from File B
.
.
etc. i.e a few of the common set comparison operations.
And the dumbest, quickest and dirtiest and most idiot-proof way of doing this as a baseline approach before I start using maps, tries, etc... is to explode Files A and B into lex sorted single combinations order and then use diff -y with judicious grepping for '>' and <''
End Aside }
I'm probably going to go with an external sort for starters to keep memory usage down. For most use-cases, I can get away with my current in-memory sort - just have to beware edge cases.
However, later on, I'm keen to do everything with set theoretic operations and avoid writing large files of single combinations to disk.
So, I guess the things I'd like to get a feel for are:
(1) Is it realistic to expect to do in-memory set comparisons on sets with ~1M elements where each element is a (however-encoded) list of (say) 6 integers? I mean realistic in terms of execution time and space usage, of course.
(2) What would be a good space-efficient encoding for these single combinations? I know that there will never be more than 8 integers in a combination, and none of these values will be < 1 or > 20. So perhaps I should map them to ByteString library Word8 strings? Presumably sorting a big list of ByteStrings is going to be faster than sorting a big list of lists of int?