Downloading Haskell repos from GitHub

Along the lines of http://blog.patch-tag.com/2010/03/13/mirroring-patch-tag/ for downloading all patch-tag.com repositories, I've begun to wonder how to download all Github repositories since more and more people seem to be using it. Nothing in http://develop.github.com/ seems especially useful for grabbing the git:// URLs of all repos by language - just by user. The only real list of repos by language seems to be gotten at via http://github.com/languages/Haskell/updated or http://github.com/languages/Haskell/created . (You might think http://github.com/languages/Haskell would be good, but no, it's just a few random repos by interest and not a full listing.) I looked at the HTML, and it looks possible to use tagsoup to get all 98 pages and then parse the entries to get the HTTP URLs of the repos, and then turn *that* into git:// URLs suitable for shelling out to 'git clone', but I can't help but wonder if maybe there's a better approach someone more familiar with Github would know. -- gwern

On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen
Nothing in http://develop.github.com/ seems especially useful for grabbing the git:// URLs of all repos by language - just by user.
The only real list of repos by language seems to be gotten at via http://github.com/languages/Haskell/updated or http://github.com/languages/Haskell/created . (You might think http://github.com/languages/Haskell would be good, but no, it's just a few random repos by interest and not a full listing.)
Github has a REST API for accessing data. Unfortunately it can't give you the wanted breakdown, but I would ask them for it. It is much simpler for you, and it does not put an extra strain on their servers due to the scraping. Usually, the github guys are helpful when you have a question. -- J.

On Fri, Apr 30, 2010 at 11:51 AM, Jesper Louis Andersen
On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen
wrote: Nothing in http://develop.github.com/ seems especially useful for grabbing the git:// URLs of all repos by language - just by user.
The only real list of repos by language seems to be gotten at via http://github.com/languages/Haskell/updated or http://github.com/languages/Haskell/created . (You might think http://github.com/languages/Haskell would be good, but no, it's just a few random repos by interest and not a full listing.)
Github has a REST API for accessing data. Unfortunately it can't give you the wanted breakdown, but I would ask them for it. It is much simpler for you,
You mean ask for a new feature? (Just a one-time list is no good since I intend to repeat it regularly to pick up new repos, just like with patch-tag.)
and it does not put an extra strain on their servers due to the scraping.
Well, it'd only be about 2000 HTTP hits. (98 + (20 * 98)). The downloading of the repos would probably reduce that demand to insignificance, especially the first time around when most of the repos would need to be downloaded.
Usually, the github guys are helpful when you have a question.
Any suggested method besides the obvious http://github.com/contact ? -- gwern

On Fri, Apr 30, 2010 at 6:02 PM, Gwern Branwen
Github has a REST API for accessing data. Unfortunately it can't give you the wanted breakdown, but I would ask them for it. It is much simpler for you,
You mean ask for a new feature? (Just a one-time list is no good since I intend to repeat it regularly to pick up new repos, just like with patch-tag.)
Yes.
Any suggested method besides the obvious http://github.com/contact ?
No. -- J.

On Fri, Apr 30, 2010 at 12:02 PM, Gwern Branwen
On Fri, Apr 30, 2010 at 11:51 AM, Jesper Louis Andersen
wrote: On Fri, Apr 30, 2010 at 5:38 PM, Gwern Branwen
wrote: Nothing in http://develop.github.com/ seems especially useful for grabbing the git:// URLs of all repos by language - just by user.
The only real list of repos by language seems to be gotten at via http://github.com/languages/Haskell/updated or http://github.com/languages/Haskell/created . (You might think http://github.com/languages/Haskell would be good, but no, it's just a few random repos by interest and not a full listing.)
Github has a REST API for accessing data. Unfortunately it can't give you the wanted breakdown, but I would ask them for it. It is much simpler for you,
You mean ask for a new feature? (Just a one-time list is no good since I intend to repeat it regularly to pick up new repos, just like with patch-tag.)
and it does not put an extra strain on their servers due to the scraping.
Well, it'd only be about 2000 HTTP hits. (98 + (20 * 98)). The downloading of the repos would probably reduce that demand to insignificance, especially the first time around when most of the repos would need to be downloaded.
Usually, the github guys are helpful when you have a question.
Ultimately, they never did anything about it: http://support.github.com/discussions/email/6782-contact-extending-api-to-ea... So I wrote a TagSoup scraper; then I wrote a long tutorial explaining how I wrote it, step by step. 1. my tutorial: http://www.gwern.net/haskell/Archiving%20GitHub.html 2. the script itself: http://www.gwern.net/haskell/Archiving%20GitHub.html#the-script 3. Reddit submission of #1 for those who prefer to comment there: http://www.reddit.com/r/haskell/comments/g7na5/writing_a_haskell_script_to_d... (While writing the tutorial, I tweaked the script code, so I'm not 100% confident that it still works - it uses too much GitHub bandwidth (and local disk space) for me to re-run it just to see whether it still works. So if anyone does run it, I would appreciate knowing whether it still works.) -- gwern http://www.gwern.net
participants (2)
-
Gwern Branwen
-
Jesper Louis Andersen