
Hi,
I'm about to rework an old program, and I am pondering what data
structure to use. Basically, I want to store an index of fixed-length
words (q-grams) with positions, i.e. given a word, I can quickly find
its position in the data.
The data set may be large (gigabytes would be nice), and I also want
the option of storing a sparse set of the words in the data
(i.e. every other word, every third, etc).
I *think* I want to use a Map from words to positions, words packed to
fit in Int(32|64|eger) (depending on word size), and positions as
Int(32|64). I've used a similar approach with FiniteMaps previously,
and it worked nicely, but for smaller data sets.
However:
Using a 'Map key [pos]' will eat a lot of space for the lists.
'Map key (UArray Int pos)' will be more efficient, but every insertion
and deletion will destroy and create a new array -- and I would like
the Map to live outside a monad (does there even exist an (IO|ST)Map?)
I've considered lists of small(ish -- cache line compatible, probably)
arrays. Another possibility is a suffix array, and a Map from words
to regions in it. How will a hashtable work in this setting?
Any thoughts?
-kzm
PS: it seems from a recent thread in comp.lang.functional (see
participants (1)
-
Ketil Malde