Wikipedia archiving bot - code review - Haskell-Cafe

25 Jun 2007

      Hey everyone. So I've been learning Haskell for a while now, and I've found the best way to move from theory to practice is to just write something useful for yourself. Now, I'm keen on editing Wikipedia and I've long wanted some way to stop links to external websites from breaking on me. So I wrote this little program using the TagSoup library which will download Wikipedia articles, parse out external links, and then ask WebCite to archive them.

But there's a problem: no matter how I look at it, it's just way too slow. Running on a measly 100 articles at a time, it'll eat up to half my processor time and RAM (according to top). I converted it over to ByteStrings since that's supposed to be a lot better than regular Strings, but that didn't seem to help much.
So I'm curious: in what way could this code be better? How could it be more idiomatic or shorter? Particularly, how could it be more efficient either in space or time? Any comments are appreciate.

{- Module      :  Main.hs
   License     :  public domain
   Maintainer  :  Gwern Branwen 
   Stability   :  unstable
   Portability :  portable
   Functionality: retrieve specified articles from Wikipedia and request WebCite to archive all URLs found.
   TODO: send an equivalent request to the Internet Archive.
         Not in any way rate-limited.
   BUGS: Issues redundant archive requests.
         Currently uses Data.ByteString.Lazy.Char8. If I'm understanding the documentation right, this barfs
         on the full UTF-8 character set, but Wikipedia definitely exercises the full UTF-8 set.
   USE: Print to stdin a succession of Wikipedia article names (whitespace in names should be escaped as '_').
        A valid invocation might be, say: '$echo Fujiwara_no_Teika Fujiwara_no_Shunzei | archive-bot'
        All URLs in [[Fujiwara no Teika]] and [[Fujiwara no Shunzei]] would then be backed up.
        If you wanted to run this on all of Wikipedia, you could take the current 'all-titles-in-ns0'
        gzipped file from [[WP:DUMP]], gunzip it, and then pipe it into archive-bot. -}

module Main where
import Text.HTML.TagSoup (parseTags, Tag(TagOpen))
import Text.HTML.Download (openURL)
import Data.List (isPrefixOf)
import Monad (liftM)
import Data.Set (toList, fromList)
import qualified Data.ByteString.Lazy.Char8 as B (ByteString(), getContents, lines, unlines, pack, unpack, words)

main :: IO ()
main = do mapM_ archiveURL =<< (liftM sortNub $ mapM fetchArticleText =<< (liftM B.words $ B.getContents))
              where sortNub :: [[B.ByteString]] -> [B.ByteString]
                    sortNub = toList . fromList . concat

fetchArticleText :: B.ByteString -> IO [B.ByteString]
fetchArticleText article = liftM (B.lines . extractURLs) (openURL(wikipedia ++ B.unpack article))
                           where wikipedia = "http://en.wikipedia.org/wiki/"

extractURLs :: String -> B.ByteString
extractURLs arg = B.unlines $ map B.pack ([x | TagOpen "a" atts <- (parseTags arg), (_,x) <- atts, "http://" `isPrefixOf` x])

archiveURL :: B.ByteString -> IO String
archiveURL url = openURL("www.webcitation.org/archive?url=" ++ (B.unpack url) ++ emailAddress)
                 where emailAddress = "&email=gwern0@gmail.com"

--
gwern
MAC10 M3 L34A1 Walther MPL AKS-74 HK-GR6 subsonic rounds ballistic media special

Wikipedia archiving bot - code review

Gwern Branwen

Neil Mitchell

Henning Thielemann

dons＠cse.unsw.edu.au

tags

participants (4)