
Hi all, I'm playing with the TagSoup library trying to extract links to original pictures from my Flickr Sets page. This programs first loads the Sets page, open links to each set, get links to pictures and then search for original picture link (see steps in main function). It does the job, but for the tests I just wanted to take 10 links to reduce the time the program runs. Just hoping that haskell laziness would magically take the minimum amount of data required to get the first 10 links out of this set of pages. I did this replacing: (putStrLn . unlines . concat) origLinks with (putStrLn . unlines . take 10 . concat) origLinks in the main function. With the last version of that line, I effectively only get 10 links but the runtime is exactly the same for both main functions. As I'm a newbie haskell programmer I certainly missing something. By the way I know Flickr has an api I could use, but the purpose was playing with TagSoup. Thanks for any advice. Olivier. Here's the code: module Main where import Data.Html.TagSoup import Control.Monad (liftM) import Data.List (isPrefixOf, groupBy) import Data.Maybe (mapMaybe) import System (getArgs) import System.Time import IO (hPutStrLn, stderr) base = "http://www.flickr.com" setsUrl name = "/photos/" ++ name ++ "/sets/" main :: IO () main = do args <- getArgs tStart <- getClockTime setLinks <- getLinksByAttr ("class", "Seta") (base ++ setsUrl (args !! 0)) picLinks <- mapM (getLinksByAttr ("class", "image_link")) setLinks origLinks <- mapM (getLinksAfterImgByAttr ("src", "http://l.yimg.com/www.flickr.com/images/icon_download.gif")) $ (mapMaybe linkToOrigSize . concat) picLinks (putStrLn . unlines . concat) origLinks tEnd <- getClockTime hPutStrLn stderr ( timeDiffToString $ diffClockTimes tEnd tStart ) -- | extract all links from "a" tag types having given attribute getLinksByAttr :: (String, String) -> String -> IO [String] getLinksByAttr attr url = do sects <- getSectionsByTypeAndAttr "a" attr url return $ hrefs sects -- | get "a" tags following a "img" having a specific attribute getLinksAfterImgByAttr :: (String, String) -> String -> IO [String] getLinksAfterImgByAttr attr url = do sects <- getSectionsByTypeAndAttr "img" attr url return $ hrefs $ map (dropWhile (not . isTagOpen) . drop 1) sects -- | create sections from tag type and attribute getSectionsByTypeAndAttr :: String -> (String, String) -> String -> IO [[Tag]] getSectionsByTypeAndAttr tagType attr url = do tags <- liftM parseTags $ openURL $ url (return . filterByTypeAndAttr tagType attr) tags where filterByTypeAndAttr :: String -> (String, String) -> [Tag] -> [[Tag]] filterByTypeAndAttr t a = sections (~== TagOpen t [a]) -- | extract href values from sections of "a" tags hrefs :: [[Tag]] -> [String] hrefs = map (addBase . fromAttrib "href" . head) where addBase :: String -> String addBase s | "http://" `isPrefixOf` s = s addBase s | otherwise = base ++ s -- | transform a link to a picture into a link to the original size picture linkToOrigSize :: String -> Maybe String linkToOrigSize link = if parts !! 3 == "photos" then Just $ newUrl parts else Nothing where parts = map tail $ groupBy (const(/='/')) link newUrl p = "http://www.flickr.com/photo_zoom.gne?id=" ++ p !! 5 ++ "&size=o&context=" ++ p !! 7

"Olivier Boudry"
I did this replacing: (putStrLn . unlines . concat) origLinks with (putStrLn . unlines . take 10 . concat) origLinks
Unfortunately, 'origLinks' has already been computed in full, before the 'take 10' applies to it. Why? Because 'origLinks' is the result of an I/O action, which forces it:
main = do ... origLinks <- mapM (getLinksAfterImgByAttr ...) picLinks
What you really want to do is to trim the picLinks before you download them. e.g.
main = do ... origLinks <- mapM (getLinksAfterImgByAttr ...) (take 10 picLinks)
Regards, Malcolm

On Fri, Jun 22, 2007 at 04:53:42PM +0100, Malcolm Wallace wrote:
"Olivier Boudry"
wrote: I did this replacing: (putStrLn . unlines . concat) origLinks with (putStrLn . unlines . take 10 . concat) origLinks
Unfortunately, 'origLinks' has already been computed in full, before the 'take 10' applies to it. Why? Because 'origLinks' is the result of an I/O action, which forces it:
main = do ... origLinks <- mapM (getLinksAfterImgByAttr ...) picLinks
What you really want to do is to trim the picLinks before you download them. e.g.
main = do ... origLinks <- mapM (getLinksAfterImgByAttr ...) (take 10 picLinks)
Or make this lazy with:
main = do ... origLinks <- mapM (unsafeInterleaveIO . getLinksAfterImgByAttr ...) picLinks -- David Roundy Department of Physics Oregon State University

Reading code like the following: main = do s <- getContents let r = map processIt (lines s) putStr (unlines r) I was thinking all IO operations were lazy. But in fact it looks like getContents is lazy by design but not the whole IO stuff. Thank you all for your helpful answers, Olivier.

On 6/22/07, David Roundy
Or make this lazy with:
main = do ... origLinks <- mapM (unsafeInterleaveIO . getLinksAfterImgByAttr ...) picLinks -- David Roundy Department of Physics Oregon State University
Just for info I used your tip to bring laziness into the function that fetches the URLs. Work great and lazy now! -- | create sections from tag type and attribute getSectionsByTypeAndAttr :: String -> (String, String) -> String -> IO [[Tag]] getSectionsByTypeAndAttr tagType attr url = do tags <- unsafeInterleaveIO $ liftM parseTags $ openURL $ url (return . filterByTypeAndAttr tagType attr) tags where filterByTypeAndAttr :: String -> (String, String) -> [Tag] -> [[Tag]] filterByTypeAndAttr t a = sections (~== TagOpen t [a]) Thanks, Olivier.

On Fri, 22 Jun 2007, Olivier Boudry wrote:
-- | create sections from tag type and attribute getSectionsByTypeAndAttr :: String -> (String, String) -> String -> IO [[Tag]] getSectionsByTypeAndAttr tagType attr url = do tags <- unsafeInterleaveIO $ liftM parseTags $ openURL $ url (return . filterByTypeAndAttr tagType attr) tags where filterByTypeAndAttr :: String -> (String, String) -> [Tag] -> [[Tag]] filterByTypeAndAttr t a = sections (~== TagOpen t [a])
I think if openURL is not lazy, then the unsafeInterleaveIO will not help much because it only defers the whole computation until the first part of the result is requested. One call to unsafeInterleaveIO cannot divide a big IO action into small pieces of lazily triggered actions. I think unsafeInterleaveIO is unnecessary here.

It does the job, but for the tests I just wanted to take 10 links to reduce the time the program runs. Just hoping that haskell laziness would magically take the minimum amount of data required to get the first 10 links out of this set of pages.
I haven't read the details of the post. But I think its due to lazy operations not beeing lazy by default. Have a look at this thread it might help http://groups.google.com/group/fa.haskell/browse_thread/thread/5deaee07a8398d07/d5b3c85aa8c2860c?lnk=st&q=Marc+Weber+lazyIO&rnum=1&hl=en#d5b3c85aa8c2860c All which is done is throwing in a unsafeInterleaveIO at some locations. Because I didn't want to implement all list functions again I had the idea of inventing the LazyIO monad (which calls unsafeInterleaveIO automatically) But doing this to often resulted in no list processing at all ;) I hope that this gives you a hint to look more stuff up on the wiki using the search etc. If this didn't help post again and I'll have a closer look. Marc Weber

Marc, Thanks for the link. Your LazyIO monad is really interesting. Do you know if this construct exists in GHC? (this question was left open in this thread) Olivier.

On Fri, Jun 22, 2007 at 01:51:12PM -0400, Olivier Boudry wrote:
Marc,
Thanks for the link. Your LazyIO monad is really interesting. Do you know if this construct exists in GHC? (this question was left open in this thread)
I couldn't find it. That's why I've written it. I think there is not much interest because the question was left open. Marc
participants (5)
-
David Roundy
-
Henning Thielemann
-
Malcolm Wallace
-
Marc Weber
-
Olivier Boudry