
This was inspired by the discussion in [Libraries] on how to validate the contents of a distributed package. John Meacham wrote:
The whole idea of tying a key to an existential identity is flawed, let the key _be_ the identity and all problems go away.
Let me suggest. Why cannot the following simple approach be used as a basis? Consider some "distribution unit", OK, let's call it "Package". It has a property of being distributed independently, say, as a tar ball (or a zip file). Right before packing, a digest may be computed (say MD5 or whatever else, doesn't really matter at this point) on the contents of a directory (with subdirectories) containing the distribution. Then a tar ball file is created in traditional way. The tarball file itself will be named based on the digest: for example, if we have computed a digest d41d8cd98f00b204e9800998ecf8427e then the file will be named "package_d41d8cd98f00b204e9800998ecf8427e.tar.gz". When a developer wishes to announce a new package, he/she supplies this digest-containing name (digest calculated just before creation of the tarball), and a "descriptive" name (say Package Foo version 8 revision 75) will accompany the announce. Such an announce being placed on one (or several) of many web-enabled mailing lists (such as the haskell mailing list for instance, or fm-announce for freshmeat) will soon be indexed by various search engines, so it will be easy to verify (even in some automated way provided that all announce messages are forced to be formatted uniformly) the identity (digest) of a package given its descriptive name. In any ambiguity, priority must be given to the earliest post found (as it was most likely made by the original developer). Someone willing to use a package, locates and downloads the tarball (by its name which contains the digest). It may then be unpacked, the digest produced: if it matches that in the file name, the file is valid, otherwise not. Additionally, size and checksum of the tarball may be also added to the announce message. So the only question I cannot answer (not being a crypto expert) is whether there exists such a digest algorithm that makes it nearly impossible to supply a distribution file with different contents, same size, and same checksum, but after unpacking, yielding the same digest value. With this approach, there is no need to have a centralized storage for packages: the Internet itself becomes such a storage, and the packages database is replaced by existing search engines given we are able to make search requests programmatically (see my earlier post in [Haskell-cafe] on locating Cabal files with Google). So, to locate a package, one searches (googles) for its descriptive name: result will return URLs of all cabal files indexed by the search engine containing such a name. The next step is downloading the cabal file and finding the distribution tarball file name. So, instead of going into cryptography problems, I am suggesting to use the contents of a package for self-validation (via computed digest), and to use the "public record" such as an internet-archived mailing list message to establish a truth about relationship between "descriptive" name (which is human-readable) and "distribution" name (which contains a digest to verify). Implementation of this approach does not need any extra software (except maybe a wrapper around md5sum or like to digest a whole directory with subdirectories) and few other shell scripts. Perhaps a tag needs to be added to cabal file syntax which would contain the digest (if one does not exist, at least I couldn't find it in the documentation at haskell.org/cabal). Distribution file name may be derived from that digest. -- Dimitry Golubovsky Anywhere on the Web