ANN: islink 0.1.0.0: check if an HTML element is a link (useful for web scraping)

Hello everybody, I'd like to announce the first public release of islink. It's library that basically provides a list of combinations of HTML tag names and attributes that correspond to links to external resources. This includes things like ("a", "href"), ("img", "src"), ("script", "src") etc. It also comes with a convenience function to check if a particular pair (tag, attribute) corresponds to a link. This can be useful for web scraping. Here's an example how to use it to extract all (external) links from an HTML document (with the help of hxt): {-# LANGUAGE Arrows #-} import Text.Html.IsLink import Text.XML.HXT.Core -- returns a list of tuples containing the tag name, attribute name, -- attribute value of all links getAllLinks :: FilePath -> IO [(String, String, String)] getAllLinks path = runX $ doc >>> multi getLink where doc = readDocument [withParseHTML yes, withWarnings no] path getLink :: ArrowXml a => a XmlTree (String, String, String) getLink = proc node -> do tag <- getName -< node attrbNode <- getAttrl -< node attrb <- getName -< attrbNode val <- xshow getChildren -< attrbNode isLinkA -< (tag, attrb, val) where isLinkA = isLink `guardsP` this isLink (tag, attrb, _) = isLinkAttr tag attrb
participants (1)
-
Marios Titas