
12 Mar
2008
12 Mar
'08
3:55 p.m.
Duncan Coutts
To get something really compact we could use an index composed of three unboxed Int arrays.
To get something *really* compact, we could build a (kind of) suffix array. That is, we do a lexical sort of the lines, and store the sorted offsets of the lines in an array. You can then look up any index by binary search using this array. That's 10-15 characters plus one offset (4 bytes) per line, ~1.3 times the original file. (There are also compressed suffix array structures, but I don't think those will gain you anything if you don't store all the positions.) -k -- If I haven't seen further, it is by standing in the footprints of giants