[Haskell-cafe] Shouldn't this be lazy???

22 Jun 2007

      Hi all,

I'm playing with the TagSoup library trying to extract links to
original pictures from my Flickr Sets page. This programs first loads
the Sets page, open links to each set, get links to pictures and then
search for original picture link (see steps in main function).

It does the job, but for the tests I just wanted to take 10 links to
reduce the time the program runs. Just hoping that haskell laziness
would magically take the minimum amount of data required to get the
first 10 links out of this set of pages.

I did this replacing:
   (putStrLn . unlines . concat) origLinks
with
   (putStrLn . unlines . take 10 . concat) origLinks
in the main function.

With the last version of that line, I effectively only get 10 links
but the runtime is exactly the same for both main functions.

As I'm a newbie haskell programmer I certainly missing something.

By the way I know Flickr has an api I could use, but the purpose was
playing with TagSoup.

Thanks for any advice.

Olivier.

Here's the code:

module Main where

import Data.Html.TagSoup
import Control.Monad (liftM)
import Data.List (isPrefixOf, groupBy)
import Data.Maybe (mapMaybe)
import System (getArgs)
import System.Time
import IO (hPutStrLn, stderr)

base    = "http://www.flickr.com"
setsUrl name = "/photos/" ++ name ++ "/sets/"

main :: IO ()
main = do
    args      <- getArgs
    tStart    <- getClockTime
    setLinks  <- getLinksByAttr ("class", "Seta") (base ++ setsUrl (args !! 0))
    picLinks  <- mapM (getLinksByAttr ("class", "image_link")) setLinks
    origLinks <- mapM (getLinksAfterImgByAttr ("src",
"http://l.yimg.com/www.flickr.com/images/icon_download.gif")) $
(mapMaybe linkToOrigSize . concat) picLinks
    (putStrLn . unlines . concat) origLinks
    tEnd      <- getClockTime
    hPutStrLn stderr ( timeDiffToString $ diffClockTimes tEnd tStart )

-- | extract all links from "a" tag types having given attribute
getLinksByAttr :: (String, String) -> String -> IO [String]
getLinksByAttr attr url = do
    sects <- getSectionsByTypeAndAttr "a" attr url
    return $ hrefs sects

-- | get "a" tags following a "img" having a specific attribute
getLinksAfterImgByAttr :: (String, String) -> String -> IO [String]
getLinksAfterImgByAttr attr url = do
    sects <- getSectionsByTypeAndAttr "img" attr url
    return $ hrefs $ map (dropWhile (not . isTagOpen) . drop 1) sects

-- | create sections from tag type and attribute
getSectionsByTypeAndAttr :: String -> (String, String) -> String -> IO [[Tag]]
getSectionsByTypeAndAttr tagType attr url = do
    tags <- liftM parseTags $ openURL $ url
    (return . filterByTypeAndAttr tagType attr) tags
  where
    filterByTypeAndAttr :: String -> (String, String) -> [Tag] -> [[Tag]]
    filterByTypeAndAttr t a = sections (~== TagOpen t [a])

-- | extract href values from sections of "a" tags
hrefs :: [[Tag]] -> [String]
hrefs = map (addBase . fromAttrib "href" . head)
  where
    addBase :: String -> String
    addBase s | "http://" `isPrefixOf` s = s
    addBase s | otherwise                = base ++ s

-- | transform a link to a picture into a link to the original size picture
linkToOrigSize :: String -> Maybe String
linkToOrigSize link =
    if parts !! 3 == "photos" then
        Just $ newUrl parts
        else
            Nothing
  where
    parts = map tail $ groupBy (const(/='/')) link
    newUrl p = "http://www.flickr.com/photo_zoom.gne?id=" ++ p !! 5 ++
"&size=o&context=" ++ p !! 7

[Haskell-cafe] Shouldn't this be lazy???

Olivier Boudry