Fingerprinting Haskell Objects

Hello everybody, I have a little question I wanted to run by the folks here. I've run into it several times over the past few years and would love to lock down a good answer. What's the best way to "fingerprint" a Haskell object into, say, ByteString, so that this fingerprint can be used as the "lookup key" in a database (for example) and be trusted that it will remain constant over time even as the underlying libraries evolve? Here's a simple example: - Say I'm building a manual index on top of a key-value store (redis, dynamodb, etc.) - I want my keys to be arbitrary tuples (or similar records) that may contain various fields in them - I would like to avoid ad-hoc, hand-written MyTuple -> ByteString and ByteString -> MyTuple conversions. However, Generic derivations, template-haskell, etc. are acceptable - Notice how your fingerprint, which is used as a lookup key in the database, has to remain stationary. If it changes even by a single bit over time for the same MyTuple, the key-value store will NOT be able to find the index associated with MyTuple at this later time Here are some ideas (and related concepts) I've considered and used over the years: - Hand-write a "Prism' MyTuple ByteString". This works, but is tedious and error-prone. - Use Serialize/Binary and trust that the encode/decode pair will produce results consistently in 5 years (dangerous territory!) - Use SafeCopy, which is great for ensuring timeless decoding of the *value* in the index, but can we be sure that fingerprint (MyTuple -> ByteString) conversion is persistent? What if SafeCopy authors one day decide to encode tuples differently? They would write the migrations to transparently handle legacy code for *values*, but not for *keys*. Also notice here how migrations help with the ByteString -> MyTuple leg, but do not ensure MyTuple -> ByteString produces the same ByteString over time. - Hashable would've been nice, but there is NO guarantee of persistent results, even across multiple runs of the same code What would be your preferred solution? Thank you, Oz

Assuming the Generic instance is a stable interface, I would create a
traversal of that, feeding directly into a Blake2b-implementation (a fast
SHA3 finalist, tweaked).
This gives you a cryptographically strong fingerprint, space usage is
flexible (extract as many bytes as you want), is fast (~1GB/s), and with
low complexity/external dependencies.
Alexander
On Tue, Oct 7, 2014 at 10:30 PM, Ozgun Ataman
Hello everybody,
I have a little question I wanted to run by the folks here. I've run into it several times over the past few years and would love to lock down a good answer.
What's the best way to "fingerprint" a Haskell object into, say, ByteString, so that this fingerprint can be used as the "lookup key" in a database (for example) and be trusted that it will remain constant over time even as the underlying libraries evolve?
Here's a simple example:
- Say I'm building a manual index on top of a key-value store (redis, dynamodb, etc.)
- I want my keys to be arbitrary tuples (or similar records) that may contain various fields in them
- I would like to avoid ad-hoc, hand-written MyTuple -> ByteString and ByteString -> MyTuple conversions. However, Generic derivations, template-haskell, etc. are acceptable
- Notice how your fingerprint, which is used as a lookup key in the database, has to remain stationary. If it changes even by a single bit over time for the same MyTuple, the key-value store will NOT be able to find the index associated with MyTuple at this later time
Here are some ideas (and related concepts) I've considered and used over the years:
- Hand-write a "Prism' MyTuple ByteString". This works, but is tedious and error-prone.
- Use Serialize/Binary and trust that the encode/decode pair will produce results consistently in 5 years (dangerous territory!)
- Use SafeCopy, which is great for ensuring timeless decoding of the *value* in the index, but can we be sure that fingerprint (MyTuple -> ByteString) conversion is persistent? What if SafeCopy authors one day decide to encode tuples differently? They would write the migrations to transparently handle legacy code for *values*, but not for *keys*. Also notice here how migrations help with the ByteString -> MyTuple leg, but do not ensure MyTuple -> ByteString produces the same ByteString over time.
- Hashable would've been nice, but there is NO guarantee of persistent results, even across multiple runs of the same code
What would be your preferred solution?
Thank you, Oz
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I’m not sure hashing is what is desired due to the ByteString -> MyTuple conversion that was mentioned.
–
Kyle Marek-Spartz
On Oct 7, 2014, 4:15:55 PM, Alexander Kjeldaas

I think the two problems should be separated. MyTuple -> ByteString should have the properties that the encoding is stable and 1:1. ByteString -> MyTuple should have the property that old data is readable, but it does not need to be 1:1. So any library that supports backwards compatibility can be used for the ByteString -> MyTuple conversion, including SafeCopy. If SafeCopy implements a new, faster encoding, that does not affect this conversion. For a key/value store, the hash of the object can be used as key, and the SafeCopy serialization can be stored in the value together with whatever else is required. Alexander On Wed, Oct 8, 2014 at 12:21 AM, Kyle Marek-Spartz < kyle.marek.spartz@gmail.com> wrote:
I’m not sure hashing is what is desired due to the ByteString -> MyTuple conversion that was mentioned.
– Kyle Marek-Spartz
On Oct 7, 2014, 4:15:55 PM, Alexander Kjeldaas < alexander.kjeldaas@gmail.com> wrote: ------------------------------
Assuming the Generic instance is a stable interface, I would create a traversal of that, feeding directly into a Blake2b-implementation (a fast SHA3 finalist, tweaked).
This gives you a cryptographically strong fingerprint, space usage is flexible (extract as many bytes as you want), is fast (~1GB/s), and with low complexity/external dependencies.
Alexander

On Tue, Oct 7, 2014 at 5:15 PM, Alexander Kjeldaas < alexander.kjeldaas@gmail.com> wrote:
Assuming the Generic instance is a stable interface, I would create a traversal of that, feeding directly into a Blake2b-implementation (a fast SHA3 finalist, tweaked).
This gives you a cryptographically strong fingerprint, space usage is flexible (extract as many bytes as you want), is fast (~1GB/s), and with low complexity/external dependencies.
Alexander
Thank you for the reply. My own thinking has been along similar lines of finding a stable serialization for the record. Hashing is certainly a nice way to compress the result - one I've used myself in the past. However, can we really consider Generic a stable interface?
participants (3)
-
Alexander Kjeldaas
-
Kyle Marek-Spartz
-
Ozgun Ataman