
Conal Elliott wrote:
Hi Wren,
I considered the idea of hashing, but not *perfect* hashing. I don't know how to hash perfectly with something like expressions, which have infinitely many values.
An imperfect hash can work. You'll need a memo table with a source of unique symbols (e.g. storing the next unused integer) in order to, effectively, make the "hash" function perfect[1]. If you have a source of unique symbols then you can also use a trie, Data.Map, or similar in lieu of a hash map. In a language with pointers (or stable names), the pointer is often used as the "hash" in conjunction with using the memo table as an intern table for smart constructors. Thus, client code can never observe that structurally equal expressions could have different hashes. [1] hash :: HashState m => Expr -> m Hash hash e = case lookupHS e of Just h -> return h Nothing -> do h <- nextH insertHS e h return h
Since it's stateful, that means the smart constructors may need to be in an appropriate monad/applicative for passing the memo table around (some hash functions may not need to store the table explicitly).
Hm -- stateful? Unless I'm misunderstanding, a stateful & monadic/applicative approach would break the simple functional interface I'm going for. Could well be I haven't formed a mental picture that matches yours.
Er, it's only stateful for the versions above that use pointers or a source of unique symbols (since they need to maintain a memo table). If you can come up with a perfect hash function[2], then there's no need to create/store the memo table at all, since it can be reconstructed on the fly. Since perfect hashing often isn't feasible, the stateful approximations to a perfect hash function are generally used. Sorry if I was unclear. If you don't mind unsafePerformIO (or similar hacks) then you can hide the state from the type system by using the same table for the whole program. Generally for hash cons'ing you want your tables to be as large as they can be (to maximize sharing) so this shouldn't be problematic. However, for languages with scoping it can be beneficial to use separate tables to recognize when expressions need to be recomputed; so the global store might want to be something like a stack of memo tables with fall-through lookup. I believe Applicative is powerful enough to capture the sort of state passing needed since the client code can't ever make decisions based on the state. So with smart constructors (to package up the <*> etc) I'd think you should be able to have an EDSL that looks nice, just with a more complicated type. Perhaps the issues are with mixing pure Haskell functions into the EDSL? ... The real trick behind hash cons'ing is everywhere substituting the "hash" in for the sub-expression, effectively flattening all expressions into a single ply. Thus, expression constructors "cons the hashes" rather than cons'ing expressions. It's similar in spirit to trie'ing, but from the bottom up in the same way that dynamic programming is done. The reason for wanting to do the hashing in smart constructors, as opposed to at the end, is to maximize the benefit of dynamic programming. If all client-visible expressions are represented by hashes, then any structure sharing in the Haskell layer is sharing the hash representation, thus you don't need to traverse the shared substructure multiple times. (If you hand construct equal expressions without sharing, then you'll have to traverse each expression to prove that they're equal, but you can use that proof (the hashes) thenceforth). For host languages with destructive updates (like Smalltalk's "become"), you can rewrite the subterms as you traverse them, so doing it at the end isn't too bad. If you only expose smart constructors then your Expr type can "recurse" as whatever Hash type. If you do the hashing at the end, then you'll need to define a catamorphism on Expr. ... This is probably similar to what you're doing in Pan, Vertigo, and Pajama (I haven't read it). The general technique is elegant in its simplicity, and it's not uncommon. Though, like most dynamic programming tricks, it seems not to be as widely known as I would assume, so I thought I'd mention it in case you've missed it. [2] Into any domain of terms that can quickly answer (==), namely flat terms like Integer. Using a bounded type like Int can give better performance guarantees, but there's only so many of them. -- Live well, ~wren