Let the key (digest of contents) _be_ the identity (was: hackage, cabal-get, and security)

23 May 2005

      This was inspired by the discussion in [Libraries] on how to validate
the contents of a distributed package.

John Meacham wrote:
...
The whole idea of tying a key to an
existential identity is flawed, let the key _be_ the identity and all
problems go away.
Let me suggest.

Why cannot the following simple approach be used as a basis?

Consider some "distribution unit", OK, let's call it "Package". It has
a property of being distributed independently, say, as a tar ball (or
a zip file). 

Right before packing, a digest may be computed (say MD5 or whatever
else, doesn't really matter at this point) on the contents of a
directory (with subdirectories) containing the distribution. Then a
tar ball file is created in traditional way.

The tarball file itself will be named based on the digest: for example, if we
have computed a digest d41d8cd98f00b204e9800998ecf8427e  then the file
will be named "package_d41d8cd98f00b204e9800998ecf8427e.tar.gz".

When a developer wishes to announce a new package, he/she
supplies this digest-containing name (digest calculated just before creation
of the tarball), and a "descriptive" name (say Package Foo version 8
revision 75) will accompany the announce. Such an announce being
placed on one (or several) of many web-enabled mailing lists (such as
the haskell mailing list for instance, or fm-announce for freshmeat)
will soon be indexed by various search engines, so it will be easy to
verify (even in some automated way provided that all announce messages
are forced to be formatted uniformly) the identity (digest) of a package
given its descriptive name. In any ambiguity, priority must be given
to the earliest post found (as it was most likely made by the original
developer).

Someone willing to use a package, locates and downloads the tarball
(by its name which contains the digest). It may then be unpacked, the
digest produced: if it matches that in the file name, the file is
valid, otherwise not. Additionally, size and checksum of the tarball
may be also added to the announce message.

So the only question I cannot answer (not being a crypto expert) is
whether there exists such a digest algorithm that makes it nearly
impossible to supply a
distribution file with different contents, same size, and same
checksum, but after unpacking, yielding the same digest value.

With this approach, there is no need to have a centralized storage for
packages: the Internet itself becomes such a storage, and the packages
database is replaced by existing search engines given we are able to
make search requests programmatically (see my earlier post in
[Haskell-cafe] on locating Cabal files with Google). So, to locate a
package, one searches (googles) for its descriptive name: result will
return URLs of all cabal files indexed by the search engine containing
such a name. The next step is downloading the cabal file and finding
the distribution tarball file name.

So, instead of going into cryptography problems, I am suggesting to
use the contents of a package for self-validation (via computed
digest), and to use the "public record" such as an internet-archived
mailing list message to establish a truth about relationship between
"descriptive" name (which is human-readable) and "distribution" name
(which contains a digest to verify).

Implementation of this approach does not need any extra software
(except maybe a wrapper around md5sum or like to digest a whole
directory with subdirectories) and few other shell scripts.

Perhaps a tag needs to be added to cabal file syntax which would
contain the digest (if one does not exist, at least I couldn't find it
in the documentation at haskell.org/cabal). Distribution file name may
be derived from that digest.

-- 
Dimitry Golubovsky

Anywhere on the Web