Wikipedia archiving bot - code review

Hey everyone. So I've been learning Haskell for a while now, and I've found the best way to move from theory to practice is to just write something useful for yourself. Now, I'm keen on editing Wikipedia and I've long wanted some way to stop links to external websites from breaking on me. So I wrote this little program using the TagSoup library which will download Wikipedia articles, parse out external links, and then ask WebCite to archive them.
But there's a problem: no matter how I look at it, it's just way too slow. Running on a measly 100 articles at a time, it'll eat up to half my processor time and RAM (according to top). I converted it over to ByteStrings since that's supposed to be a lot better than regular Strings, but that didn't seem to help much.
So I'm curious: in what way could this code be better? How could it be more idiomatic or shorter? Particularly, how could it be more efficient either in space or time? Any comments are appreciate.
{- Module : Main.hs
License : public domain
Maintainer : Gwern Branwen

Hi
You may find that the slow down is coming from your use of the TagSoup
library - I'm currently reworking the parser to make sure its fully
lazy and doesn't space leak. I hope that the version in darcs tomorrow
will have all those issues fixed.
Thanks
Neil
On 6/26/07, Gwern Branwen
Hey everyone. So I've been learning Haskell for a while now, and I've found the best way to move from theory to practice is to just write something useful for yourself. Now, I'm keen on editing Wikipedia and I've long wanted some way to stop links to external websites from breaking on me. So I wrote this little program using the TagSoup library which will download Wikipedia articles, parse out external links, and then ask WebCite to archive them.
But there's a problem: no matter how I look at it, it's just way too slow. Running on a measly 100 articles at a time, it'll eat up to half my processor time and RAM (according to top). I converted it over to ByteStrings since that's supposed to be a lot better than regular Strings, but that didn't seem to help much. So I'm curious: in what way could this code be better? How could it be more idiomatic or shorter? Particularly, how could it be more efficient either in space or time? Any comments are appreciate.
{- Module : Main.hs License : public domain Maintainer : Gwern Branwen
Stability : unstable Portability : portable Functionality: retrieve specified articles from Wikipedia and request WebCite to archive all URLs found. TODO: send an equivalent request to the Internet Archive. Not in any way rate-limited. BUGS: Issues redundant archive requests. Currently uses Data.ByteString.Lazy.Char8. If I'm understanding the documentation right, this barfs on the full UTF-8 character set, but Wikipedia definitely exercises the full UTF-8 set. USE: Print to stdin a succession of Wikipedia article names (whitespace in names should be escaped as '_'). A valid invocation might be, say: '$echo Fujiwara_no_Teika Fujiwara_no_Shunzei | archive-bot' All URLs in [[Fujiwara no Teika]] and [[Fujiwara no Shunzei]] would then be backed up. If you wanted to run this on all of Wikipedia, you could take the current 'all-titles-in-ns0' gzipped file from [[WP:DUMP]], gunzip it, and then pipe it into archive-bot. -} module Main where import Text.HTML.TagSoup (parseTags, Tag(TagOpen)) import Text.HTML.Download (openURL) import Data.List (isPrefixOf) import Monad (liftM) import Data.Set (toList, fromList) import qualified Data.ByteString.Lazy.Char8 as B (ByteString(), getContents, lines, unlines, pack, unpack, words)
main :: IO () main = do mapM_ archiveURL =<< (liftM sortNub $ mapM fetchArticleText =<< (liftM B.words $ B.getContents)) where sortNub :: [[B.ByteString]] -> [B.ByteString] sortNub = toList . fromList . concat
fetchArticleText :: B.ByteString -> IO [B.ByteString] fetchArticleText article = liftM (B.lines . extractURLs) (openURL(wikipedia ++ B.unpack article)) where wikipedia = "http://en.wikipedia.org/wiki/"
extractURLs :: String -> B.ByteString extractURLs arg = B.unlines $ map B.pack ([x | TagOpen "a" atts <- (parseTags arg), (_,x) <- atts, "http://" `isPrefixOf` x])
archiveURL :: B.ByteString -> IO String archiveURL url = openURL("www.webcitation.org/archive?url=" ++ (B.unpack url) ++ emailAddress) where emailAddress = "&email=gwern0@gmail.com"
-- gwern MAC10 M3 L34A1 Walther MPL AKS-74 HK-GR6 subsonic rounds ballistic media special
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Tue, 26 Jun 2007, Neil Mitchell wrote:
You may find that the slow down is coming from your use of the TagSoup library - I'm currently reworking the parser to make sure its fully lazy and doesn't space leak. I hope that the version in darcs tomorrow will have all those issues fixed.
Are you talking about the Hackage version? The darcs version contains the module Text/HTML/TagSoup/Test.hs which tests the lazy behaviour by feeding infinite input to the parser and looks if results can be computed. If you (still) find situations where the parser is not lazy enough, then add new test cases please.

gwern0:
Hey everyone. So I've been learning Haskell for a while now, and I've found the best way to move from theory to practice is to just write something useful for yourself. Now, I'm keen on editing Wikipedia and I've long wanted some way to stop links to external websites from breaking on me. So I wrote this little program using the TagSoup library which will download Wikipedia articles, parse out external links, and then ask WebCite to archive them.
But there's a problem: no matter how I look at it, it's just way too slow. Running on a measly 100 articles at a time, it'll eat up to half my processor time and RAM (according to top). I converted it over to ByteStrings since that's supposed to be a lot better than regular Strings, but that didn't seem to help much. So I'm curious: in what way could this code be better? How could it be more idiomatic or shorter? Particularly, how could it be more efficient either in space or time? Any comments are appreciate.
{- Module : Main.hs License : public domain Maintainer : Gwern Branwen
Stability : unstable Portability : portable Functionality: retrieve specified articles from Wikipedia and request WebCite to archive all URLs found. TODO: send an equivalent request to the Internet Archive. Not in any way rate-limited. BUGS: Issues redundant archive requests. Currently uses Data.ByteString.Lazy.Char8. If I'm understanding the documentation right, this barfs on the full UTF-8 character set, but Wikipedia definitely exercises the full UTF-8 set. USE: Print to stdin a succession of Wikipedia article names (whitespace in names should be escaped as '_'). A valid invocation might be, say: '$echo Fujiwara_no_Teika Fujiwara_no_Shunzei | archive-bot' All URLs in [[Fujiwara no Teika]] and [[Fujiwara no Shunzei]] would then be backed up. If you wanted to run this on all of Wikipedia, you could take the current 'all-titles-in-ns0' gzipped file from [[WP:DUMP]], gunzip it, and then pipe it into archive-bot. -} module Main where import Text.HTML.TagSoup (parseTags, Tag(TagOpen)) import Text.HTML.Download (openURL) import Data.List (isPrefixOf) import Monad (liftM) import Data.Set (toList, fromList) import qualified Data.ByteString.Lazy.Char8 as B (ByteString(), getContents, lines, unlines, pack, unpack, words)
main :: IO () main = do mapM_ archiveURL =<< (liftM sortNub $ mapM fetchArticleText =<< (liftM B.words $ B.getContents)) where sortNub :: [[B.ByteString]] -> [B.ByteString] sortNub = toList . fromList . concat
fetchArticleText :: B.ByteString -> IO [B.ByteString] fetchArticleText article = liftM (B.lines . extractURLs) (openURL(wikipedia ++ B.unpack article)) where wikipedia = "http://en.wikipedia.org/wiki/"
extractURLs :: String -> B.ByteString extractURLs arg = B.unlines $ map B.pack ([x | TagOpen "a" atts <- (parseTags arg), (_,x) <- atts, "http://" `isPrefixOf` x])
archiveURL :: B.ByteString -> IO String archiveURL url = openURL("www.webcitation.org/archive?url=" ++ (B.unpack url) ++ emailAddress) where emailAddress = "&email=gwern0@gmail.com"
you don't seem to be using bytestrings for anything important here -- you just pass them in, and immediately unpack them back to String anyway -- since tagsoup only downloads String, and parses String. Probably, as neil says, TagSoup just isn't optimised much yet. Perhaps try the bytestring-based urlcheck? http://hackage.haskell.org/cgi-bin/hackage-scripts/package/urlcheck-0.1 Neil, perhaps tagsoup should provide at the bottom a bytestring layer -- so there's some hope of efficient downloading, with a String layer on top -- not the other way around? -- Don
participants (4)
-
dons@cse.unsw.edu.au
-
Gwern Branwen
-
Henning Thielemann
-
Neil Mitchell