
-------------------------------------------- -- Announcing: bytestring-trie 0.1.0 (beta) -------------------------------------------- An efficient finite map from (byte)strings to values. The implementation is based on big-endian patricia trees, like Data.IntMap. We first trie on the Word8 elements of a Data.ByteString, sharing string prefixes where possible, and then trie on the big-endian bit representation of those elements. Patricia trees have efficient algorithms for union and other merging operations, but they're also quick for lookups and insertions. -------------------------------------------- -- Future Extensions -------------------------------------------- * I've done spot checking to make sure it works, but haven't constructed an extensive test suite. Does anyone know of a good data set already out there for testing corner cases? I'm sure other dictionary writers have come up with one. * A Word8 is much smaller than the architecture's natural Word size. Therefore it'd be more efficient for the trie on bits to read off a whole Word at a time. This work is in progress, but I need help from people with 64-bit and big-endian machines in order to verify the code works on those architectures. * Using ByteStrings allows for trieing on any packed byte representation of a value, but they're not Strings. Integration with Data.ByteString.Char8, utf8-string, and utf8-light should happen. * The current implementation also only accepts strict ByteStrings. When chopping up strings to share prefixes we share the smaller string. For very long strings when many deletions are expected, this can still hold onto more memory than necessary. Accepting lazy ByteStrings or adding heuristics for when to copy prefixes instead of sharing will fix this. -------------------------------------------- -- Links -------------------------------------------- Homepage: http://code.haskell.org/~wren/ Hackage: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-trie Darcs: http://code.haskell.org/~wren/bytestring-trie/ Haddock (Darcs version): http://code.haskell.org/~wren/bytestring-trie/dist/doc/html/bytestring-trie/ -- Live well, ~wren

Thanks for working on this! A nice efficient bytestring-trie is the sort of data structure we should have had in Haskell for some time now, and I'm sure I'll be giving it a great deal of use. Regards, Sterl. On Dec 20, 2008, at 1:06 AM, wren ng thornton wrote:
-------------------------------------------- -- Announcing: bytestring-trie 0.1.0 (beta) --------------------------------------------
An efficient finite map from (byte)strings to values.
The implementation is based on big-endian patricia trees, like Data.IntMap. We first trie on the Word8 elements of a Data.ByteString, sharing string prefixes where possible, and then trie on the big-endian bit representation of those elements. Patricia trees have efficient algorithms for union and other merging operations, but they're also quick for lookups and insertions.
-------------------------------------------- -- Future Extensions --------------------------------------------
* I've done spot checking to make sure it works, but haven't constructed an extensive test suite. Does anyone know of a good data set already out there for testing corner cases? I'm sure other dictionary writers have come up with one.
* A Word8 is much smaller than the architecture's natural Word size. Therefore it'd be more efficient for the trie on bits to read off a whole Word at a time. This work is in progress, but I need help from people with 64-bit and big-endian machines in order to verify the code works on those architectures.
* Using ByteStrings allows for trieing on any packed byte representation of a value, but they're not Strings. Integration with Data.ByteString.Char8, utf8-string, and utf8-light should happen.
* The current implementation also only accepts strict ByteStrings. When chopping up strings to share prefixes we share the smaller string. For very long strings when many deletions are expected, this can still hold onto more memory than necessary. Accepting lazy ByteStrings or adding heuristics for when to copy prefixes instead of sharing will fix this.
-------------------------------------------- -- Links --------------------------------------------
Homepage: http://code.haskell.org/~wren/
Hackage: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/ bytestring-trie
Darcs: http://code.haskell.org/~wren/bytestring-trie/
Haddock (Darcs version): http://code.haskell.org/~wren/bytestring-trie/dist/doc/html/ bytestring-trie/
-- Live well, ~wren
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-------------------------------------------- -- bytestring-trie 0.1.1 (bugfix) -------------------------------------------- An efficient finite map from (byte)strings to values. The implementation is based on big-endian patricia trees, like Data.IntMap. We first trie on the Word8 elements of a Data.ByteString, sharing string prefixes where possible, and then trie on the big-endian bit representation of those elements. Patricia trees have efficient algorithms for union and other merging operations, but they're also quick for lookups and insertions. -------------------------------------------- -- Changes -------------------------------------------- * Fixed a bug in lookupBy_ pointed out by Maxime Henrion. The bug affects all "lookup-like" functions when a prefix of the query matches only part of a shared prefix in the trie (e.g. looking for "fi" in a trie containing ["foo","bar","baz"], but not when looking up "fo", "food", or "bag"). * By popular demand Trie now has a Binary instance. This adds a new dependency on the binary package. The dependency shouldn't be onerous to anyone, but let me know if it is. -------------------------------------------- -- Links -------------------------------------------- Homepage: http://code.haskell.org/~wren/ Hackage: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-trie Darcs: http://code.haskell.org/~wren/bytestring-trie/ Haddock (Darcs version): http://code.haskell.org/~wren/bytestring-trie/dist/doc/html/bytestring-trie/ -- Live well, ~wren

Question and suggestion: looking at http://hackage.haskell.org/packages/archive/bytestring-trie/0.1.1/doc/html/s... I am questioning your choice of foldr in fromList:
-- | Convert association list into a trie. On key conflict, values -- earlier in the list shadow later ones. fromList :: [(KeyString,a)] -> Trie a fromList = foldr (uncurry insert) empty
-- | /O(1)/, The empty trie. {-# INLINE empty #-} empty :: Trie a empty = Empty
-- | Insert a new key. If the key is already present, overrides the -- old value {-# INLINE insert #-} insert :: KeyString -> a -> Trie a -> Trie a insert = alterBy (\_ x _ -> Just x)
-- | Generic function to alter a trie by one element with a function -- to resolve conflicts (or non-conflicts). alterBy :: (KeyString -> a -> Maybe a -> Maybe a) -> KeyString -> a -> Trie a -> Trie a alterBy f_ q_ x_ | S.null q_ = mergeBy (\x y -> f_ q_ x (Just y)) (singleton q_ x_) | otherwise = go q_ where
-- | /O(1)/, Is the trie empty? {-# INLINE null #-} null :: Trie a -> Bool null Empty = True null _ = False
So it looks like the reduction is fromList - uncurry insert - alterBy - null. Let me use insert in place of uncurry insert: fromList ( (a,1) : ( (b,2) : ( (c,3) : [] ) ) ) (a,1) `insert` ( (b,2) `insert` ( (c,3) `insert` Empty ) ) ) So fromList forces the whole call chain above to be traversed until it hits the Empty. For a large input list this will force the whole list to be allocated before proceeding AND the call chain might overflow the allowed stack size in ghc. For a large trie (which is a likely use case) this is a poor situation. If you use foldl' then the input list is only forced one element at a time. A small change to the lambda that insert passes to adjustBy will retain the same semantics of earlier key wins (which are an especially good idea in the foldl' case). Cheers, Chris
participants (3)
-
ChrisK
-
Sterling Clover
-
wren ng thornton