Parsing cabal files to calculate average number of dependencies

Athas on #haskell wondered how many dependencies the average Haskell package had. I commented that it seemed like some fairly simple scripting to find out, and as these things tend to go, I wound up doing a complete solution myself. First, we get most/all of Hackage locally to examine, as tarballs: for package in `cabal list | grep '\*' | tr -d '\*'`; do cabal fetch $package; done Then we cd .cabal/packages/hackage.haskell.org Now we can run a command which extracts the .cabal file from each tarball to standard output: find . -name "*.tar.gz" -exec tar --wildcards "*.cabal" -Oxf {} \; We could grep for 'build-depends' or something, but that gives unreliable dirty results. (>80k items, resulting in a hard to believe 87k total deps and an average of 27 deps.) So instead, we use the Cabal library and write a program to parse Cabal files & spit out the dependencies, and we feed each .cabal into that: find . -name "*.tar.gz" -exec sh -c 'tar --wildcards "*.cabal" -Oxf {} | runhaskell ~/deps.hs' \; And what is deps.hs? Turns out to be surprisingly easy to parse a String, extract the Library and Executable AST, and grab the [Dependency] field, and then print it out (code is not particularly clean): import Distribution.Package import Distribution.PackageDescription import Distribution.PackageDescription.Parse main :: IO () main = do cbl <- getContents let desc = parsePackageDescription cbl case desc of ParseFailed _ -> return () ParseOk _ d -> putStr $ unlines $ map show $ map (\(Dependency x _) -> x) $ extractDeps d extractDeps :: GenericPackageDescription -> [Dependency] extractDeps d = ldeps ++ edeps where ldeps = case (condLibrary d) of Nothing -> [] Just c -> condTreeConstraints c edeps = concat $ map (condTreeConstraints . snd) $ condExecutables d So what are the results? (The output of one run is attached.) I get 18,134 dependencies, having run on 3,137 files, or 5.8 dependencies per package. -- gwern http://www.gwern.net

On Fri, Jul 1, 2011 at 1:43 PM, Gwern Branwen
Athas on #haskell wondered how many dependencies the average Haskell package had. I commented that it seemed like some fairly simple scripting to find out, and as these things tend to go, I wound up doing a complete solution myself.
First, we get most/all of Hackage locally to examine, as tarballs:
for package in `cabal list | grep '\*' | tr -d '\*'`; do cabal fetch $package; done
I think the index tarball has all the info you need, and would be faster to retrieve / process, if you or anyone else needs to get the .cabal files again: http://hackage.haskell.org/packages/archive/00-index.tar.gz (2.2mb) The set of the latest package sdists is also available: http://hackage.haskell.org/cgi-bin/hackage-scripts/archive.tar (~150mb) --Rogan
Then we cd .cabal/packages/hackage.haskell.org
Now we can run a command which extracts the .cabal file from each tarball to standard output:
find . -name "*.tar.gz" -exec tar --wildcards "*.cabal" -Oxf {} \;
We could grep for 'build-depends' or something, but that gives unreliable dirty results. (>80k items, resulting in a hard to believe 87k total deps and an average of 27 deps.) So instead, we use the Cabal library and write a program to parse Cabal files & spit out the dependencies, and we feed each .cabal into that:
find . -name "*.tar.gz" -exec sh -c 'tar --wildcards "*.cabal" -Oxf {} | runhaskell ~/deps.hs' \;
And what is deps.hs? Turns out to be surprisingly easy to parse a String, extract the Library and Executable AST, and grab the [Dependency] field, and then print it out (code is not particularly clean):
import Distribution.Package import Distribution.PackageDescription import Distribution.PackageDescription.Parse main :: IO () main = do cbl <- getContents let desc = parsePackageDescription cbl case desc of ParseFailed _ -> return () ParseOk _ d -> putStr $ unlines $ map show $ map (\(Dependency x _) -> x) $ extractDeps d extractDeps :: GenericPackageDescription -> [Dependency] extractDeps d = ldeps ++ edeps where ldeps = case (condLibrary d) of Nothing -> [] Just c -> condTreeConstraints c edeps = concat $ map (condTreeConstraints . snd) $ condExecutables d
So what are the results? (The output of one run is attached.) I get 18,134 dependencies, having run on 3,137 files, or 5.8 dependencies per package.
-- gwern http://www.gwern.net
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Fri, Jul 1, 2011 at 5:23 PM, Rogan Creswick
I think the index tarball has all the info you need, and would be faster to retrieve / process, if you or anyone else needs to get the .cabal files again:
http://hackage.haskell.org/packages/archive/00-index.tar.gz (2.2mb)
Looking at it, the index tarball contains the .cabal files for all versions known to Hackage, which isn't necessarily the interesting set of cabal files - I'm usually more interested in just the cabal files of the latest version of every package. No doubt there's a scripting solution (loop over the untarred directory of packages, and take the lexically last cabal file?), but it was easier to just exploit cabal fetch's behavior of fetching only the latest version and work with those tarballs. -- gwern http://www.gwern.net

On Fri, Jul 1, 2011 at 5:37 PM, Gwern Branwen
Looking at it, the index tarball contains the .cabal files for all versions known to Hackage, which isn't necessarily the interesting set of cabal files - I'm usually more interested in just the cabal files of the latest version of every package. No doubt there's a scripting solution (loop over the untarred directory of packages, and take the lexically last cabal file?), but it was easier to just exploit cabal fetch's behavior of fetching only the latest version and work with those tarballs.
The version using just the index tarball is kind of ugly; the filtering and extracting doesn't seem terribly easy, so the best script I could come up with was: cd ~/.cabal/packages/hackage.haskell.org && for DIR in */; do (for CABAL in `tar --wildcards "$DIR" -tf 00-index.tar|head -1`; do (tar -Oxf 00-index.tar $CABAL | runhaskell ~/deps.hs); done); done (Parentheses aren't necessary but make it more readable.) -- gwern http://www.gwern.net

Another thing you can do along the same lines is generate a script to download all the repos from packages which declare repos. Some ugly code: import Data.Maybe (fromJust) import Distribution.PackageDescription import Distribution.PackageDescription.Parse import Control.Monad (unless) main :: IO () main = do cbl <- getContents let desc = parsePackageDescription cbl case desc of ParseFailed _ -> return () ParseOk _ d -> do let repos = repoPair $ extractHead $ extractRepos d let cmd = concatMap shellify repos unless (null cmd) $ putStrLn cmd shellify :: (RepoType, String) -> String shellify (rt,url) = case rt of Darcs -> "darcs get " ++ url Git -> "git clone " ++ url SVN -> "svn clone " ++ url CVS -> "cvs co " ++ url Mercurial -> "hg clone " ++ url _ -> "" repoPair :: [SourceRepo] -> [(RepoType, String)] repoPair = map (\x -> (fromJust $ repoType x, fromJust $ repoLocation x)) extractHead :: [SourceRepo] -> [SourceRepo] extractHead rs = filter (\x -> isnothing x && ishead x) rs where ishead sr = case repoKind sr of RepoHead -> True _ -> False isnothing ss = case repoType ss of Nothing -> False Just _ -> case repoLocation ss of Nothing -> False Just _ -> True extractRepos :: GenericPackageDescription -> [SourceRepo] extractRepos = sourceRepos . packageDescription This generates results (with the same find command and setup as previously) like: ... git clone git://gitorious.org/maximus/mandulia.git darcs get http://darcs.cielonegro.org/HsOpenSSL/ darcs get http://darcs.cielonegro.org/HsOpenSSL/ hg clone https://bitbucket.org/bos/text-icugit clone https://github.com/bos/text-icu darcs get http://code.haskell.org/Graphalyze darcs get http://code.haskell.org/~roelvandijk/code/base-unicode-symbols git clone git://github.com/roelvandijk/base-unicode-symbols.git darcs get http://code.haskell.org/~basvandijk/code/regions git clone https://github.com/skogsbaer/xmlgen git clone git://github.com/tanakh/HongoDB.git darcs get http://repos.mornfall.net/shellish darcs get http://patch-tag.com/r/Saizan/syb-with-class/ git clone git://github.com/ekmett/eq.git git clone git://github.com/ekmett/data-lens-fd.git git clone git://github.com/ekmett/streams.git git clone git://github.com/alanz/hjsmin.git darcs get http://patch-tag.com/r/byorgey/diagrams-lib ... -- gwern http://www.gwern.net/haskell/Archiving%20GitHub

On 02.07.2011 01:26, Gwern Branwen wrote:
Another thing you can do along the same lines is generate a script to download all the repos from packages which declare repos. Some ugly code:
If 'script' also includes Haskell code, then the 'tar' package could be of help to walk throught the TAR archive.
participants (3)
-
Gwern Branwen
-
Henning Thielemann
-
Rogan Creswick