
I'd like to announce wp-archivebot. # What wp-archivebot is a relatively simple little script which follows all the links in a RSS feed, combs the destination for http:// links, and submits them to WebCite. WebCite https://secure.wikimedia.org/wikipedia/en/wiki/WebCite is an organization much like the more famous Internet Archive. Unlike the Wayback Machine, however, WebCite will archive pages on-demand.* # Why This is good, since link-rot and 404 errors are a fact of life on Wikipedia. Links go stale, fall dead, get banned, edited, censored, etc. If those links are being used as a reference for some important fact or detail, then there is a very big problem. Even the hit-or-miss Internet Archive has proven to be very useful for editors**, so a more reliable way of archiving links would be even better. # Limitations The WebCite FAQhttp://webcitation.org/faq mentions that a good project would be to
develop a wikipedia bot which scans new wikipedia articles for cited URLs, submits an archiving request to WebCite®, and then adds a link to the archived URL behind the cited URL
Adding a link would be both quite difficult and require community approval; further, although I have thought about this for years, there's no obvious good way to add a link. Any method is either visually awkward, possibly otiose (if [[Google]] links to google.com as the homepage in its infobox, there's no purpose to have an archived version of google.com!), and certainly will bloat up the markup - even if there's any way to insert links without bolloxing templates and other such constructs. So I'm satisfied to just archive the link. WebCite is searchable, after all. If enough people run bots like this and achieve enough coverage, then perhaps editors can be educated to always check in WebCite as well. # Download & Install As ever, wp-archivebot is Free and is available from Hackage at: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/wp-archivebot You can install with ease by a simple 'cabal install wp-archivebot', or download the tarball and compile it yourself with the usual 'runhaskell Setup configure && runhaskell Setup build && runhaskell Setup install' dance. # Usage wp-archivebot takes one mandatory argument, an email address; WebCite needs to have somewhere to send notices of archival success/failure. wp-archivebot takes a second, optional, argument. This is a RSS feed to use. It defaults to Special:NewPages on the English Wikipedia, but one could just as well follow, say, RecentChanges. Here's an example:
wp-archivebot gwern0@gmail.com 'http://en.wikipedia.org/w/index.php?title=Special:RecentChanges&feed=rss'
(This sets my email address as the recipient, and follows RecentChanges. This may not be a good idea as RecentChanges is *much* busier than NewPages.) ## Example Here's an example session's output: [12:35 PM] 829Mb$ wp-archivebot gwern0@gmail.com "http://www.webcitation.org/archive?url=http://en.wikisource.org/wiki/Berkeley,_George,_first_earl_of_Berkeley_(DNB00)&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.baseball-reference.com/players/u/uhaltfr01.shtml&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.baseball-reference.com/players/u/uhaltfr01.shtml&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.baseball-reference.com/minors/player.cgi?id=uhalt-001ber&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.baseball-reference.com/minors/player.cgi?id=uhalt-001ber&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.timesargus.com/apps/pbcs.dll/article?AID=/20080509/FEATURES02/805090316/1011/FEATURES02&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.timesargus.com/apps/pbcs.dll/article?AID=/20080509/FEATURES02/805090316/1011/FEATURES02&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.timesargus.com/apps/pbcs.dll/article?AID=/20080509/FEATURES02/805090316/1011/FEATURES02&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.timesargus.com/apps/pbcs.dll/article?AID=/20080509/FEATURES02/805090316/1011/FEATURES02&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.erniestires.net/&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.erniestires.net/&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://thepeerage.com/p4893.htm#i48927&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://thepeerage.com/p4893.htm#i48927&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.leighrayment.com/commons/Acommons3.htm&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.leighrayment.com/commons/Acommons3.htm&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://thepeerage.com/p4893.htm#i48927&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://thepeerage.com/p4893.htm#i48927&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.esec.edu&email=gwern0@gmail.com" "http://www.webcitation.org/archive?url=http://www.esec.edu&email=gwern0@gmail.com" ... # Related The development version (HEAD) of the Gitit wiki has plugin support; one of those plugins, WebArchiver.hs, will on every page-save comb through for off-wiki links and submit them to WebCite in the same way as this bot. It's nice to know that if those links ever disapear, you can retrieve them from WebCite and 'see' the revision with the same set of external links as when the revision was created. * Technically, the Internet Archive will archive on demand as well - but you need to pay them. ** In many more ways than one might expect. For example, not infrequently someone will visit an article and claim it is plagiarizing some other webpage. With the IA, it's easy to go back to the first version of that webpage and crosscheck against the article's history - quite often it is the other website that plagiarized us! -- gwern