Adding Content-Addressable Storage to GHC

Hi all, Is there any effort or designs ongoing to add CAS (content-addressable storage) to GHC, as in Unison? < https://www.unisonweb.org/docs/tour/> == The idea == The summary of the idea is simply that top-level declarations can be addressed by a hash of their contents. Recursive definitions are transformed into the worker/wrapper to eliminate the self-referencing issue of hashing. == Why I want this == There are lots of advantages to this, but the one that excites me the most is that we can move to running tests, especially property tests, at compile-time. The main downside to running tests at compile-time, as seen done with template-haskell is that you will re-run tests every time the module is recompiled, making your dev cycle slower. However, if your tests are keyed upon CAS hashes, then those hashes are only invalidated when individual declarations actually change. This means the re-running of tests becomes granular at the declaration-level. When a single test completes, either successfully or not, you can cache the result and lookup the result next time, using e.g. the SHA512 of the expression evaluated. Therefore you could change a single function in a library and it would only re-run the tests that are actually affected, rather than running all the tests in the whole module, and rather than the more typical approach which is running ALL tests in a test suite just because one thing changed. If you can couple tests with code then you can avoid the decoupling of code from the tests. == Implementation approaches == There are various ways to implement this with varying degrees of satisfaction: 1. Use TH: reify declarations, inspect the AST, and produce a SHA512. Use ambient values such as the GHC version, instances in scope, extensions, ghc options, etc. With TH, I'm confident that you can only achieve an imperfect hash because I doubt that all information is available to TH. Names that come from external packages could be treated as CAS'd at the scope of the package's installed hash. Ideally, you could have granularity into other packages. But it's not a necessity if you just want caching for your current development package. 2. Use a source plugin. A source plugin is already capable of accessing all GHC context information, so this might lead to more of a perfect hash. 3. Add it to GHC directly. Exposing a `expressionSHA512 :: Exp -> ByteString` could be one imaginary way to access this information. With such a function you could implement caching of fine-grained tests. A related discussion is the deterministic builds: https://gitlab.haskell.org/ghc/ghc/wikis/deterministic-builds Anyone else exploring this? Cheers, Chris

I am not exploring, but watching with great interest. And may not be able
to resist jumping in if something comes of it.
Alan
On Wed, 18 Mar 2020 at 11:23, Chris Done
Hi all,
Is there any effort or designs ongoing to add CAS (content-addressable storage) to GHC, as in Unison? < https://www.unisonweb.org/docs/tour/>
== The idea ==
The summary of the idea is simply that top-level declarations can be addressed by a hash of their contents. Recursive definitions are transformed into the worker/wrapper to eliminate the self-referencing issue of hashing.
== Why I want this ==
There are lots of advantages to this, but the one that excites me the most is that we can move to running tests, especially property tests, at compile-time.
The main downside to running tests at compile-time, as seen done with template-haskell is that you will re-run tests every time the module is recompiled, making your dev cycle slower. However, if your tests are keyed upon CAS hashes, then those hashes are only invalidated when individual declarations actually change. This means the re-running of tests becomes granular at the declaration-level. When a single test completes, either successfully or not, you can cache the result and lookup the result next time, using e.g. the SHA512 of the expression evaluated.
Therefore you could change a single function in a library and it would only re-run the tests that are actually affected, rather than running all the tests in the whole module, and rather than the more typical approach which is running ALL tests in a test suite just because one thing changed.
If you can couple tests with code then you can avoid the decoupling of code from the tests.
== Implementation approaches ==
There are various ways to implement this with varying degrees of satisfaction:
1. Use TH: reify declarations, inspect the AST, and produce a SHA512. Use ambient values such as the GHC version, instances in scope, extensions, ghc options, etc. With TH, I'm confident that you can only achieve an imperfect hash because I doubt that all information is available to TH.
Names that come from external packages could be treated as CAS'd at the scope of the package's installed hash. Ideally, you could have granularity into other packages. But it's not a necessity if you just want caching for your current development package.
2. Use a source plugin. A source plugin is already capable of accessing all GHC context information, so this might lead to more of a perfect hash.
3. Add it to GHC directly. Exposing a `expressionSHA512 :: Exp -> ByteString` could be one imaginary way to access this information. With such a function you could implement caching of fine-grained tests.
A related discussion is the deterministic builds: https://gitlab.haskell.org/ghc/ghc/wikis/deterministic-builds
Anyone else exploring this?
Cheers,
Chris
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

I slightly feel like we should first upgrade the typeable hashing rep from
md5 to sha3-256 or something like that (though i've talked about how
finding a compilable pair of collisions would make for a fun april first
package ), but I do think theres certainly some really cool things if we
think about that sort of direction carefully. I'm not aware of any efforts
in this direction atm
On Wed, Mar 18, 2020 at 2:05 PM Alan & Kim Zimmerman
I am not exploring, but watching with great interest. And may not be able to resist jumping in if something comes of it.
Alan
On Wed, 18 Mar 2020 at 11:23, Chris Done
wrote: Hi all,
Is there any effort or designs ongoing to add CAS (content-addressable storage) to GHC, as in Unison? < https://www.unisonweb.org/docs/tour/>
== The idea ==
The summary of the idea is simply that top-level declarations can be addressed by a hash of their contents. Recursive definitions are transformed into the worker/wrapper to eliminate the self-referencing issue of hashing.
== Why I want this ==
There are lots of advantages to this, but the one that excites me the most is that we can move to running tests, especially property tests, at compile-time.
The main downside to running tests at compile-time, as seen done with template-haskell is that you will re-run tests every time the module is recompiled, making your dev cycle slower. However, if your tests are keyed upon CAS hashes, then those hashes are only invalidated when individual declarations actually change. This means the re-running of tests becomes granular at the declaration-level. When a single test completes, either successfully or not, you can cache the result and lookup the result next time, using e.g. the SHA512 of the expression evaluated.
Therefore you could change a single function in a library and it would only re-run the tests that are actually affected, rather than running all the tests in the whole module, and rather than the more typical approach which is running ALL tests in a test suite just because one thing changed.
If you can couple tests with code then you can avoid the decoupling of code from the tests.
== Implementation approaches ==
There are various ways to implement this with varying degrees of satisfaction:
1. Use TH: reify declarations, inspect the AST, and produce a SHA512. Use ambient values such as the GHC version, instances in scope, extensions, ghc options, etc. With TH, I'm confident that you can only achieve an imperfect hash because I doubt that all information is available to TH.
Names that come from external packages could be treated as CAS'd at the scope of the package's installed hash. Ideally, you could have granularity into other packages. But it's not a necessity if you just want caching for your current development package.
2. Use a source plugin. A source plugin is already capable of accessing all GHC context information, so this might lead to more of a perfect hash.
3. Add it to GHC directly. Exposing a `expressionSHA512 :: Exp -> ByteString` could be one imaginary way to access this information. With such a function you could implement caching of fine-grained tests.
A related discussion is the deterministic builds: https://gitlab.haskell.org/ghc/ghc/wikis/deterministic-builds
Anyone else exploring this?
Cheers,
Chris
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

On Wed, Mar 18, 2020 at 11:21:29AM +0000, Chris Done wrote:
Is there any effort or designs ongoing to add CAS (content-addressable storage) to GHC, as in Unison? < https://www.unisonweb.org/docs/tour/> == The idea == The summary of the idea is simply that top-level declarations can be addressed by a hash of their contents. Recursive definitions are transformed into the worker/wrapper to eliminate the self-referencing issue of hashing.
I started wanting this recently, although I can't remember why. As far as I can see there are two distinct things that one might want 1. To be able to come up with a hash for each expression. 2. To be able to look up that hash somewhere to recover the expression. I was only interested in 1. Please correct me if I'm wrong but it seems from your description that your use case only requires 1 too. Anyway, I think this is a great idea. Tom
participants (4)
-
Alan & Kim Zimmerman
-
Carter Schonwald
-
Chris Done
-
Tom Ellis