
Hi - I seem to remember from some discussion long ago that there is a Data.Atom module somewhere (hopefully with a BSD licence), but a search on Hoogle turned up no results so maybe I'm mistaken. What I'm looking for is an Atom type and the following operations: fromString :: String -> Atom toString :: Atom -> String instance Eq Atom instance Ord Atom -- this is where things get difficult! It is particularly important that the ordering comparison is extremely fast, preferably independent of the length of the original String, and that it preserves the lexicographic ordering between the original Strings. I can see that an "unsafe" global ref to a Trie of Char with Unique as the "value" of a node would allow me to implement fromString, toString, and instance Eq Atom, but I've got no idea how to implement instance Ord Atom so that the order is independent of the order in which Atoms are created and exactly the same as the lexicographic ordering of the String without being O(n) where n is the min of the lengths of the Atoms being compared. I'm also hoping that Atoms which are no longer in use would manage to magically vanish by themselves. In a Trie implementation, this would mean maintaining an invariant that leaf nodes are always held by a weak link and internal nodes by a strong link. Any ideas (or pointers to a nice downloadable module that already does all this :-) )? Thanks, Brian. -- Logic empowers us and Love gives us purpose. Yet still phantoms restless for eras long past, congealed in the present in unthought forms, strive mightily unseen to destroy us. http://www.metamilk.com

Hello Brian, Sunday, July 2, 2006, 10:58:29 PM, you wrote:
fromString :: String -> Atom toString :: Atom -> String
instance Eq Atom instance Ord Atom -- this is where things get difficult!
i think that ByteString is a very strong candidate to Atom. `memicmp` is very fast operation, unless you plan to use really large strings with the same beginnings -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Sun, 2006-07-02 at 23:08 +0400, Bulat Ziganshin wrote:
Hello Brian,
Sunday, July 2, 2006, 10:58:29 PM, you wrote:
fromString :: String -> Atom toString :: Atom -> String
instance Eq Atom instance Ord Atom -- this is where things get difficult!
i think that ByteString is a very strong candidate to Atom. `memicmp` is very fast operation, unless you plan to use really large strings with the same beginnings
and also equal lengths. The nice thing about the ByteString representation is that we can use the length to short-cut inequality and use equality of the pointers to short-cut equality: eq :: ByteString -> ByteString -> Bool eq a@(PS p s l) b@(PS p' s' l') | l /= l' = False -- short cut on length | p == p' && s == s' = True -- short cut for the same string | otherwise = compareBytes a b == EQ So the worst case is long strings that compare equal but have different memory blocks. Duncan

Hello Duncan, Monday, July 3, 2006, 12:50:14 AM, you wrote:
instance Ord Atom -- this is where things get difficult!
and also equal lengths.
The nice thing about the ByteString representation is that we can use the length to short-cut inequality and use equality of the pointers to short-cut equality:
see first line of quote :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Brian,
Sunday, July 2, 2006, 10:58:29 PM, you wrote:
fromString :: String -> Atom toString :: Atom -> String
instance Eq Atom instance Ord Atom -- this is where things get difficult!
i think that ByteString is a very strong candidate to Atom. `memicmp` is very fast operation, unless you plan to use really large strings with the same beginnings
Thanks - I think I'll use ByteString as the implementation of Atom for the moment, Brian. -- Logic empowers us and Love gives us purpose. Yet still phantoms restless for eras long past, congealed in the present in unthought forms, strive mightily unseen to destroy us. http://www.metamilk.com

On 02/07/06, Brian Hulley
I can see that an "unsafe" global ref to a Trie of Char with Unique as the "value" of a node would allow me to implement fromString, toString, and instance Eq Atom, but I've got no idea how to implement instance Ord Atom so that the order is independent of the order in which Atoms are created and exactly the same as the lexicographic ordering of the String without being O(n) where n is the min of the lengths of the Atoms being compared.
Isn't compare on strings O(n) anyway? I suppose it would actually be O(d), where d is the index of the first difference, but that's O(n) worst-case. Even equality (a special case of compare) on strings (well, [Char], to be more precise) involves traversing the list and so is O(n) worst-case, no? -- -David House, dmhouse@gmail.com

David House wrote:
On 02/07/06, Brian Hulley
wrote: I can see that an "unsafe" global ref to a Trie of Char with Unique as the "value" of a node would allow me to implement fromString, toString, and instance Eq Atom, but I've got no idea how to implement instance Ord Atom so that the order is independent of the order in which Atoms are created and exactly the same as the lexicographic ordering of the String without being O(n) where n is the min of the lengths of the Atoms being compared.
Isn't compare on strings O(n) anyway? I suppose it would actually be O(d), where d is the index of the first difference, but that's O(n) worst-case. Even equality (a special case of compare) on strings (well, [Char], to be more precise) involves traversing the list and so is O(n) worst-case, no?
If an Atom was just represented by an Int, then construction would be O(n) where n is the length of the String (using a Trie with Int as the data associated with each node), and == and /= would be O(1). If the Int's could somehow be chosen clairvoyantly then < would also be O(1). However since we can't see into the future (to know which new Atoms will be created) afaics the problem of correctly choosing these Int's is completely impossible! Of course an Integer could be used (since it can be of any length) but it's likely that any systematic way of assigning them would simply encode the string in the Integer leading back to O(n) so in terms of speed for < there would be no advantage over ByteString's. So perhaps my original spec is impossible to implement, though it is an open question whether some very clever encoding (with corresponding implementation of <) could be found which would lead to a better average performance (whatever that means). An alternative design for an atom module could be: create :: MonadIO m => String -> m Atom toString :: Atom -> String instance Eq Atom -- O(1) instance Ord Atom -- O(1) but depends on creation order but here the < would not be lexicographic, so although it would be useful for implementing symbol tables, environments etc it's not ideal for GUI use (eg when displaying a tree of modules where everything should be listed alphabetically). As an aside, if the monad was removed then the result of atom "a" < atom "b" (atom :: String -> Atom) could not be determined by analysis of the program. It would depend on the evaluation order chosen by the compiler, but in a sense this doesn't matter because whatever the result is, it would be the same at any future time during the same run of the program so the use of Atoms as keys would still be safe. But is this still "functional"? Regards, Brian. -- Logic empowers us and Love gives us purpose. Yet still phantoms restless for eras long past, congealed in the present in unthought forms, strive mightily unseen to destroy us. http://www.metamilk.com

Brian Hulley wrote:
So perhaps my original spec is impossible to implement, though it is an open question whether some very clever encoding (with corresponding implementation of <) could be found which would lead to a better average performance (whatever that means).
An alternative design for an atom module could be:
create :: MonadIO m => String -> m Atom toString :: Atom -> String
instance Eq Atom -- O(1) instance Ord Atom -- O(1) but depends on creation order
but here the < would not be lexicographic, so although it would be useful for implementing symbol tables, environments etc it's not ideal for GUI use (eg when displaying a tree of modules where everything should be listed alphabetically).
Eureka! There *is* a way to get O(1) lexicographic comparisons of Atoms :-) although at the expense of O(n + k) but possibly amortized O(n + ?) creation time where n is the length of the string and k is the number of Atoms in existence at any given time. The previous reason for needing to be clairvoyant in the choice of Unique (or Integer or Int) to represent an Atom was to allow expressions such as atom "a" < atom "b" because when the code has determined the representation for the first atom the creation of the second atom mustn't be allowed to change it. However with monadic creation, such expressions can't be written, and therefore it would be possible to represent atoms as follows: newtype Atom = Atom (IORef Int) -- or IORef Integer and when new atoms are created, we could simply update the Int's referenced by other atoms to ensure that the relative ordering is still lexicographic. If we assume that it's very unlikely we'd need more than 2^31 atoms to exist at any given time (if this assumption is wrong we could use Integer instead of Int) and that usually we'd have a lot less, the Int's could be allocated with gaps between them so that we'd only occasionally need to adjust the Int's of the atoms to the left or right when a new atom is inserted in the table, and in all cases it's extremely unlikely that we'd need to adjust the representation of all the other Atoms (hence amortized O(n) hopefully). The comparison would then be: compare (Atom l) (Atom r) = unsafePerformIO $ do li <- readIORef l ri <- readIORef r return (compare li ri) which is safe as long as creation of atoms is not allowed inside unsafePerformIO (it would be nice if there was a way to tell the typechecker that a specific action is not allowed in unsafe IO) Still there would be some tricky details to work out eg how to ensure that the average number of Integer's that need to be updated on creation of each new atom is as small as possible, so I'll leave this as an exercise for the reader!!! :-) Essentially it depends on how well the following can be implemented: data OrderingToken = ??? instance Eq OrderingToken instance Ord OrderingToken create :: MonadIO m => m OrderingToken -- x < y |- createBetween x y == z where x < z and z < y createBetween :: MonadIO m => OrderingToken -> OrderingToken -> m (OrderingToken) It would be interesting to investigate what the theoretical tradeoffs would be between the complexities of the above functions (assuming we want (<) to be O(1)) and whether or not they require to be monadic. Anyway apologies for rambling on, Brian. -- Logic empowers us and Love gives us purpose. Yet still phantoms restless for eras long past, congealed in the present in unthought forms, strive mightily unseen to destroy us. http://www.metamilk.com

Brian Hulley wrote:
The comparison would then be:
compare (Atom l) (Atom r) = unsafePerformIO $ do li <- readIORef l ri <- readIORef r return (compare li ri)
which is safe as long as creation of atoms is not allowed inside unsafePerformIO (it would be nice if there was a way to tell the typechecker that a specific action is not allowed in unsafe IO)
That is highly unsafe with respect to multiple threads of evaluation. You could be changing the li and ri value in between reading them. Locking will be needed, assuming there is any sane way to manage locks inside unsafePerformIO.

G'day all.
Quoting Brian Hulley
So perhaps my original spec is impossible to implement, though it is an open question whether some very clever encoding (with corresponding implementation of <) could be found which would lead to a better average performance (whatever that means).
For what it's worth: - It's not such a dumb idea to separate equality testing from less-than testing. A (Unique,ByteString) pair makes the former fast and the latter pretty fast when it's needed. - Arithmetic coding preserves lexicographic ordering. If you have a good statistical model for what sort of strings might be used as atoms, it may be possible to construct a representation of atoms which fit in machine 32- or 64-bit word most of the time. (This is similar to the way that GHC represents Integers; use a machine word when it fits.) - Hash table implementations rely on using a cheap test first (comparing hash codes), going to a full comparison only if the cheap test doesn't rule out what you're looking for.
As an aside, if the monad was removed then the result of atom "a" < atom "b" (atom :: String -> Atom) could not be determined by analysis of the program. It would depend on the evaluation order chosen by the compiler, but in a sense this doesn't matter because whatever the result is, it would be the same at any future time during the same run of the program so the use of Atoms as keys would still be safe. But is this still "functional"?
Questions of this nature produce much disagreement. Once upon a time, program arguments and environment variables were passed to a Haskell program using a similar mechanism. I think we got rid of this because it's not "functional" if you consider programs that fork. Cheers, Andrew Bromage

Hi
module somewhere (hopefully with a BSD licence), but a search on Hoogle turned up no results so maybe I'm mistaken.
Just so people know, not finding a result in Hoogle is not a good indicator for something not existing. Hoogle doesn't index things for various reasons: * Not in the standard GHC libraries * I got a parse error when trying to produce the HTML generated by Haddock (this is about 5% of libraries) * They have a large interface which overlaps with other names, and I decided to delete them (OpenGL, Win32 - I think, I may have put these back) * Not on Windows, i.e. Posix * A bug in Hoogle - more common that you'd think :) Of course, I'm working on fixing all these issues and more, but for the moment just use Hoogle to find things, not find the absence of things. I saw this done on Haskell IRC the other day as well. Remember, Hoogle is still unreleased ever :) Thanks Neil

G'day all.
Quoting Brian Hulley
I can see that an "unsafe" global ref to a Trie of Char with Unique as the "value" of a node would allow me to implement fromString, toString, and instance Eq Atom, but I've got no idea how to implement instance Ord Atom so that the order is independent of the order in which Atoms are created and exactly the same as the lexicographic ordering of the String without being O(n) where n is the min of the lengths of the Atoms being compared.
Unique would give you an instance of Ord Atom, it just wouldn't give you lexicographic ordering. Whether you need lexicographic ordering or not largely depends on what you want atoms for. In X11, for example, atoms basically form a user-extensible enumeration type. X11 atoms do support ordering, but the ordering is arbitrary by design. The reason is that it is very important that atoms are fixed- size to make manipulation and incorporation into network protocol messages easier; that's their whole reason for existing. Lisp/Prolog atoms, on the other hand, do need lexicographic ordering. The point is that atoms which use the standard ordering that comes with Uniques isn't wrong, though it may not quite be what you want.
I'm also hoping that Atoms which are no longer in use would manage to magically vanish by themselves. In a Trie implementation, this would mean maintaining an invariant that leaf nodes are always held by a weak link and internal nodes by a strong link.
This would, of course, be inappropriate behaviour if you used arbitrary ordering for your atoms. Cheers, Andrew Bromage

Another suggestion: Put your strings in an ordered binary tree (other data structures might also work here). Make your Atom an encoding of the structure of the tree (resp. other structure). This is logically a sequence of bits, 0 for left (less than), 1 for right (greater than) - if you terminate this variable-length sequence with 1, I think it all works out OK. You then encode this sequence of bits in a Word32 (or some other convenient item), with implicit trailing zeros. root ~> 1[000...] root.left ~> 01[000...] root.right ~> 11[000...] ... There are clearly some questions such as how unbalanced is the tree likely to get (since you can't conveniently rebalance it), and what to do if your Atom "overflows", but this might give you something to work with. -- Iain Alexander ia@stryx.demon.co.uk

Iain Alexander wrote:
Another suggestion:
Put your strings in an ordered binary tree (other data structures might also work here).
Make your Atom an encoding of the structure of the tree (resp. other structure). This is logically a sequence of bits, 0 for left (less than), 1 for right (greater than) - if you terminate this variable-length sequence with 1, I think it all works out OK. You then encode this sequence of bits in a Word32 (or some other convenient item), with implicit trailing zeros.
root ~> 1[000...] root.left ~> 01[000...] root.right ~> 11[000...] ...
There are clearly some questions such as how unbalanced is the tree likely to get (since you can't conveniently rebalance it), and what to do if your Atom "overflows", but this might give you something to work with.
Thanks for the suggestion. However I think that just using a binary tree would be equivalent to just using the representation of the original string as a sequence of bits (except for "early out" eg representing 0b00001111 as just 1111 etc), and any other systematic encoding of strings that preserved all the info necessary for lexicographic ordering (that still works when new atoms are created later) would also end up using at least the same number of bit comparisons on average (since a new information preserving representation is just a permutation of the original at the bit level - if some strings are represented with fewer bits others would need more bits so it would all balance out). The only way I can see of getting truly O(1) lexicographic comparisons (by this I mean O(log P) where P is the maximum number of atoms that can exist simultaneously at any point during program execution) would be to have a "floating" representation (accessed via an IORef) where the concrete values of existing atoms are changed as necessary behind the scenes as new atoms are created, but as Chris pointed out this would require a locking operation if the atoms were used in a multi-threaded program, so all in all, it may well be much faster to just use ByteStrings especially if the Atoms are just to be used to represent fairly short strings such as module names or natural language words etc. Still I think this would be an interesting project - Prolog style Atoms with O(1) lexicographic ordering and O(n + f k) creation time where k is the size of the existing population and f is some function which minimizes the amortized time... Regards, Brian. -- Logic empowers us and Love gives us purpose. Yet still phantoms restless for eras long past, congealed in the present in unthought forms, strive mightily unseen to destroy us. http://www.metamilk.com

Brian Hulley wrote:
Iain Alexander wrote:
Another suggestion:
Put your strings in an ordered binary tree (other data structures might also work here).
Make your Atom an encoding of the structure of the tree (resp. other structure). This is logically a sequence of bits, 0 for left (less than), 1 for right (greater than) - if you terminate this variable-length sequence with 1, I think it all works out OK. You then encode this sequence of bits in a Word32 (or some other convenient item), with implicit trailing zeros.
root ~> 1[000...] root.left ~> 01[000...] root.right ~> 11[000...] ...
There are clearly some questions such as how unbalanced is the tree likely to get (since you can't conveniently rebalance it), and what to do if your Atom "overflows", but this might give you something to work with.
Thanks for the suggestion. However I think that just using a binary tree would be equivalent to just using the representation of the original string as a sequence of bits (except for "early out" eg representing 0b00001111 as just 1111 etc), and any other systematic encoding of strings that preserved all the info necessary for lexicographic ordering (that still works when new atoms are created later) would also end up using at least the same number of bit comparisons on average (since a new information preserving representation is just a permutation of the original at the bit level - if some strings are represented with fewer bits others would need more bits so it would all balance out).
The main thing is that speeding up Ord is a "memo" problem, not a "clever encoding of atoms" problem. Your fromString function seems like an IO action. It cannot be a pure function, except for identity or a hash function. (Though hashing to Int will at least speed up equality testing.). If you use unsafePerformIO to make it seem like a pure function then you need locking even if you think you have a single CPU and thread, since the runtime may assume it can evaluate expressions in weird ways. This is the "unsafe at any speed" issue in Haskell. You might turn strings s1 and s2 into atoms a1 and a2 cheaply by returning the next highest unused Word32. Keep an array of Atom->String, (see http://haskell.org/haskellwiki/Library/ArrayRef#Using_dynamic_.28resizable.2... for dynamic arrays). Call the # of atoms "N_A". If you need fromString to return the same atom if given the same string then you have to keep a tree and search down through log(N_A) strings comparisons just to assign a new atom. Otherwise this is O(1). To compare atoms a1 and a2 you have to compare the original strings s1 and s2. This original compare is O(min(length(s1),length(s2))). If you cache this then you make future comparisons faster, either O(1) or O(log(N_A)). Perhaps: (compare a1 a2) could look up the answer in an "immutable" 2D array with index (a1,a2) for which the lazy value will then be computed once (and then cached). The actual comparison would lookup up the original s1 and s2 via the global array (or map) and do the comparison once. This memo table is the explicit space/time trade off. The Atom->String array and/or the 2D memo-array could be replaced with something else, for the usual trade offs. And the only thread & locking issues come from generating the next atom without risking double-assignment. -- Chris
participants (8)
-
ajb@spamcop.net
-
Brian Hulley
-
Bulat Ziganshin
-
Chris Kuklewicz
-
David House
-
Duncan Coutts
-
Iain Alexander
-
Neil Mitchell