RFC: Data.StringMap

Hello, I've been playing about with a bit of applied AVLism to implement (hopefully) efficient String maps. I've posted what I've done so far here.. http://homepages.nildram.co.uk/~ahey/HLibs/Data.StringMap/ It's early days so it's a bit limited and hardly tested at all. I just want to ask peoples opinions on this before I go further. Like.. Does anybody object to me usurping the Data.StringMap slot in the hierarchical namespace? It seems to be vacant at the moment but maybe someone has other ideas about how this should be filled. Is anybody working on possible alternatives? If so it would be good to do some performance comparisons. I think the data structure I've used will turn out quite efficient, but there are other possibilities like using packed strings and/or hashing. Any suggestions or feature requests? Also, I'd be grateful someone could tell me what the data structure I've used is actually called :-) Thankyou -- Adrian Hey

On Thursday 18 Aug 2005 8:27 am, Axel Simon wrote:
On Thu, 2005-08-18 at 08:25 +0100, Adrian Hey wrote:
Also, I'd be grateful someone could tell me what the data structure I've used is actually called :-)
AFAIK, what you've implemented is called a trie, pronounced "try" from retrieval.
Thanks, I guessed it is some form of trie but I'm a bit confused about the precise definitions of various very similar data structures. I was looking at.. http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/ I'm not entirly clear about the difference between "tries", "compact tries" (depends what you call compact AFAICS) and "PATRICIA tries", and is a PATRICIA trie the same thing as a "PATRICIA tree"? The TString type is basically a linked list which may or may not be terminated with an N way fork (in the form of an AVL tree, N >=2), but then strings are linked lists anyway in Haskell. So is this compact or not I wonder? It's compact in the sense that it doesn't require an AVL tree at every character (most of which would be singletons in practice). I guess you could achieve the same thing with this definition.. data TString v = TString {-# UNPACK #-} !Char String v !(AVL (TString v)) .. which might look more compact, but isn't really AFAICS (well not while Strings are linked lists of boxed Chars). Replacing the String with a packed string might be better, but I'm not sure how efficient those are in reality (and if I was to go that route it might be a lot simpler and more efficient to use a completely different approach, such as hashing and a pure AVL tree say). Actually, now I think of it I will try that and compare performance of the two for simple inserts and lookups. (If it's fast enough it will certainly be much simpler to implement). Regards -- Adrian Hey

On Thu, 2005-08-18 at 12:17 +0100, Adrian Hey wrote:
On Thursday 18 Aug 2005 8:27 am, Axel Simon wrote:
On Thu, 2005-08-18 at 08:25 +0100, Adrian Hey wrote:
Also, I'd be grateful someone could tell me what the data structure I've used is actually called :-)
AFAIK, what you've implemented is called a trie, pronounced "try" from retrieval.
Thanks, I guessed it is some form of trie but I'm a bit confused about the precise definitions of various very similar data structures. I was looking at.. http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/
Ah! I thought it was a benign/naive question, but it really was a trap and I fell for it...
I'm not entirly clear about the difference between "tries", "compact tries" (depends what you call compact AFAICS) and "PATRICIA tries", and is a PATRICIA trie the same thing as a "PATRICIA tree"?
The TString type is basically a linked list which may or may not be terminated with an N way fork (in the form of an AVL tree, N >=2), but then strings are linked lists anyway in Haskell. So is this compact or not I wonder?
I guess in that sense there are no compact tries in Haskell. Unless you use PackedString.
It's compact in the sense that it doesn't require an AVL tree at every character (most of which would be singletons in practice). I guess you could achieve the same thing with this definition..
data TString v = TString {-# UNPACK #-} !Char String v !(AVL (TString v))
.. which might look more compact, but isn't really AFAICS (well not while Strings are linked lists of boxed Chars).
There is still a difference: If you look up a word, the suffix of the word will never be evaluated. In particular, the chain of TString nodes containing a character each will never be built. This could avoid potential space leaks. Furthermore, a compiler might do some optimisation magic on lists, which doesn't apply to your data type.
Replacing the String with a packed string might be better, but I'm not sure how efficient those are in reality (and if I was to go that route it might be a lot simpler and more efficient to use a completely different approach, such as hashing and a pure AVL tree say).
PackedStrings are certainly not the right thing to put into a general- purpose data structure. The problem is that if you request one character from a thunk that contains a conversion to PackedString, the whole packed string will be built before this single character can be extracted again. Hence the library would have a rather weird evaluation behaviour. I once fiddled with Olaf Chitil's pretty printing library and tried to improve it by using PackedStrings. It blew right up into my face since it forced stuff which was nicely delayed before, rendering all benefits of the concise representation useless. My 5p, Axel.

On Thursday 18 Aug 2005 1:50 pm, Axel Simon wrote:
Ah! I thought it was a benign/naive question, but it really was a trap and I fell for it...
? I think that remark shows extreme paranoia and quite disingenuous. It was a perfectly reasonable question, or so I thought. But I think you're probably right about just calling it a Trie, without qualification. The more I look at the various different(ly named) Tries I've found, the more they look the same to me. Regards -- Adrian Hey

On Fri, 2005-08-19 at 08:38 +0100, Adrian Hey wrote:
On Thursday 18 Aug 2005 1:50 pm, Axel Simon wrote:
Ah! I thought it was a benign/naive question, but it really was a trap and I fell for it...
?
I think that remark shows extreme paranoia and quite disingenuous. It was a perfectly reasonable question, or so I thought.
I'm sure it wasn't meant seriously. Axel wasn't seriously suggesting that you had set up a trap. It was intended as a light hearted comment, that the question was more subtle than it intially looked. Don't worry! :-)
But I think you're probably right about just calling it a Trie, without qualification. The more I look at the various different(ly named) Tries I've found, the more they look the same to me.
Regards -- Adrian Hey
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Adrian Hey wrote:
I'm not entirly clear about the difference between "tries", "compact tries" (depends what you call compact AFAICS) and "PATRICIA tries", and is a PATRICIA trie the same thing as a "PATRICIA tree"?
A trie and a PATRICIA tree is not the same thing. A PATRICIA tree is also known as a Radix tree, which discloses the way it works a lot more since it basicly stores bitstrings but skips over places where the strings are common. Knuth, ''The art of computer programming'' Volume 3 has a little 1.5 page treatise on PATRICIA trees and does also refer to further information. I guess the trie/tree confusion can be neglected and taken as being ''the same''. A good way of looking at tries is due to Chris Okasaki, I think. He views them as a (SML-) functor from finite maps to finite maps turning a FiniteMap Char a into FiniteMap [Char] a. ''Purely functional data structures'' is the book.

On Friday 19 Aug 2005 9:49 am, Jesper Louis Andersen wrote:
A trie and a PATRICIA tree is not the same thing. A PATRICIA tree is also known as a Radix tree, which discloses the way it works a lot more since it basicly stores bitstrings but skips over places where the strings are common. Knuth, ''The art of computer programming'' Volume 3 has a little 1.5 page treatise on PATRICIA trees and does also refer to further information.
Thanks for that reminder (I've got a copy but it never occured to me to look:-) I've just had a quick look, I'll have to look some more to grok why this is significant (Knuth seems to have explained the solution in some detail without first explaining the problem :-). Maybe I have to read the whole chapter to find that out. At.. http://www.nist.gov/dads/HTML/patriciatree.html I find the definition.. "A compact representation of a trie where any node which is an only child is merged with its parent" So put another way, every node must have at least 2 (or 0) children. Now if for "node" you read "TString terminated with Fork (or Term)", then it seems like what I've used could be called a PATRICIA tree, albeit one where the nodes are represented as linked lists. But I dunno, maybe some would say that a linked list representation means they don't qualify as being nodes at all. Or maybe the above definition is just wrong or over simplified (but it would seem to be consistent with the trees(tries) used in Data.IntMap, which are PATRICIA apparently). Anyway, this is probably all getting a little OT for this mailing list so I apologise for this. Regards -- Adrian Hey
participants (4)
-
Adrian Hey
-
Axel Simon
-
Duncan Coutts
-
Jesper Louis Andersen