
To: Casey Hawthorne
So, as I understand it, you have a very large sparse table, thousands of rows and hundreds of columns, of which each cell within a column of type String, Int, or Double can contain one of those types or nothing.
Then you to want to shuffle the rows to maximize the number of columns whose first 100 rows have at least one number (Int or Double), given a list of preferred column names since there is no guarantee that every number column will have at least one number in its first 100 rows after shuffling.
I'm wondering about hashing on the rows and hashing on the columns, then the column hash has the number of Int's or Double's (don't need the String's) in that column and the rows they are in.
The row hash would have the number of Int's and Double's in that row and what column's they are in.
Then;
Then scan the row hash and sort into descending order, and by tagging those rows, not by actually moving them.
Then I think your ready for simmulated annealing.
-- Regards, Casey