
Iain Alexander wrote:
Another suggestion:
Put your strings in an ordered binary tree (other data structures might also work here).
Make your Atom an encoding of the structure of the tree (resp. other structure). This is logically a sequence of bits, 0 for left (less than), 1 for right (greater than) - if you terminate this variable-length sequence with 1, I think it all works out OK. You then encode this sequence of bits in a Word32 (or some other convenient item), with implicit trailing zeros.
root ~> 1[000...] root.left ~> 01[000...] root.right ~> 11[000...] ...
There are clearly some questions such as how unbalanced is the tree likely to get (since you can't conveniently rebalance it), and what to do if your Atom "overflows", but this might give you something to work with.
Thanks for the suggestion. However I think that just using a binary tree would be equivalent to just using the representation of the original string as a sequence of bits (except for "early out" eg representing 0b00001111 as just 1111 etc), and any other systematic encoding of strings that preserved all the info necessary for lexicographic ordering (that still works when new atoms are created later) would also end up using at least the same number of bit comparisons on average (since a new information preserving representation is just a permutation of the original at the bit level - if some strings are represented with fewer bits others would need more bits so it would all balance out). The only way I can see of getting truly O(1) lexicographic comparisons (by this I mean O(log P) where P is the maximum number of atoms that can exist simultaneously at any point during program execution) would be to have a "floating" representation (accessed via an IORef) where the concrete values of existing atoms are changed as necessary behind the scenes as new atoms are created, but as Chris pointed out this would require a locking operation if the atoms were used in a multi-threaded program, so all in all, it may well be much faster to just use ByteStrings especially if the Atoms are just to be used to represent fairly short strings such as module names or natural language words etc. Still I think this would be an interesting project - Prolog style Atoms with O(1) lexicographic ordering and O(n + f k) creation time where k is the size of the existing population and f is some function which minimizes the amortized time... Regards, Brian. -- Logic empowers us and Love gives us purpose. Yet still phantoms restless for eras long past, congealed in the present in unthought forms, strive mightily unseen to destroy us. http://www.metamilk.com