ANN: archiver 0.1 and 0.2

I'd like to announce a small utility and library which builds on my WebArchive plugin for gitit: archiver http://hackage.haskell.org/package/archiver Source is available via `darcs get http://community.haskell.org/~gwern/archiver/`. The library half is a simple wrapper around the appropriate HTTP requests; the executable half reads a text file and loops as it (slowly) fires off requests and deletes the appropriate URL. That is, 'archiver' is a daemon which will process a specified text file, each line of which is a URL, and will one by one request that the URLs be archived or spidered by http://www.webcitation.org * and http://www.archive.org ** for future reference. That is, WebCite and the IA will store a copy of the HTML and hopefully all the non-dynamic resources the web pages need. (An example would be http://bits.blogs.nytimes.com/2010/12/07/palm-is-far-from-game-over-says-for... and http://webcitation.org/5ur7ifr12) Usage of archiver might look like `while true; do archiver ~/.urls.txt gwern0@gmail.com; done`***. There are a number of ways to populate the source text file. For example, I have a script `firefox-urls` which is called in my crontab every hour, and which looks like this: #!/bin/sh set -e cp `find ~/.mozilla/ -name "places.sqlite"` ~/ sqlite3 places.sqlite "SELECT url FROM moz_places, moz_historyvisits \ WHERE moz_places.id = moz_historyvisits.place_id and visit_date > strftime('%s','now','-1 day')*1000000 ORDER by \ visit_date;" >> ~/.urls.txt rm ~/places.sqlite This gets all visited URLs in the last time period and prints them out to the file for archiver to process. Hence, everything I browse is backed-up. More useful perhaps is a script to extract external links from Markdown files and print them to stdout: import System.Environment (getArgs) import Text.Pandoc (defaultParserState, processWithM, readMarkdown, Inline(Link), Pandoc) main = getArgs >>= mapM readFile >>= mapM_ analyzePage analyzePage x = processWithM printLinks (readMarkdown defaultParserState x) printLinks (Link _ (x, _)) = putStrLn x >> return undefined printLinks x = return x So now I can take `find . -name "*.page"`, pass the 100 or so Markdown files in my wiki as arguments, and add the thousand or so external links to the archiver queue (eg. `find . -name "*.page" | xargs runhaskell link-extractor.hs >> ~/.urls.txt`); they will eventually be archived/backed up and when combined with a tool like link-checker**** means that there never need be any broken links since one can either find a live link or use the archived version. General comments: I've used archiver for a number of weeks now. It has never caught up with my Firefox-generated backlog since WebCite seems to have IP-based throttling so you can't request more often than once per 20 seconds, according to my experiments, so I removed the hinotify 'watch file' functionality. It may be I was too hasty in removing it. * http://en.wikipedia.org/wiki/WebCite ** http://en.wikipedia.org/wiki/Internet_Archive *** There are sporadic exceptions from somewhere in the network or HTTP libraries, I think **** http://linkchecker.sourceforge.net/ -- gwern http://www.gwern.net
participants (1)
-
Gwern Branwen