calling inpure functions from pure code

Hello, I'm trying to write my first real program in Haskell, a web page scraper; at the top level, which is impure anyway, i fetch the top-level pages (which is IO), then call the pure functions parsing the structure. So far so good. That was the first part of the program and up to this point I'm pretty sure I did it right (for the structure anyway). Most of the demo programs and Haskell books cover this: you do the IO at the top level, then you process and you come back to the top level to print your results for instance. But here comes the problem: these pure functions that parse the structure, sometimes they find links and must open another page on the site... And opening that new page, well it's IO and can't be pure.. Now if I call from those pure methods IO methods, then they're not pure anymore, and in fact since those are leaf calls basically my entire program becomes impure... I had this idea, that I would make some sort of input data structure, which would be like a lazy String reading from a file: doing IO while the caller doesn't even realize and the caller can be pure. So some sort of fake webserver or website or htmlpage data structure... which is lazy. And then I give this data structure to my pure methods which parse the data, they call functions on that structure, and they can stay pure. But it sounds really contrived and probably completely the wrong solution. I can see that this is a very basic question which was probably answered hundreds of time, but I could not find the answer so far... Thank you! Emmanuel

Hi Emmanuel, when parsing the string representing a page, you could save all the links you encounter. After the parsing you would load the linked pages and start again parsing. You would redo this until no more links are returned or a maximum deepness is reached. Greetings, Daniel

Hi,
when parsing the string representing a page, you could save all the links you encounter.
After the parsing you would load the linked pages and start again parsing.
You would redo this until no more links are returned or a maximum deepness is reached.
Thanks for the tip. That sounds much more reasonable than what I mentioned. It seems a bit "spaghetti" to me though in a way (but maybe I just have to get used to the Haskell way). To be more specific about what I want to do: I want to parse TV programs. On the first page I have the daily listing for a channel. start/end hour, title, category, and link or not. To fully parse one TV program I can follow the link if it's present and get the extra info which is there (summary, pictures..). So the first scheme that comes to mind is a method which takes the DOM tree of the daily page and returns the list of programs for that day. Instead, what I must then do, is to return the incomplete programs: the data object would have the link filled in, if it's available, but the summary, picture... would be empty. Then I have a "second pass" in the caller function, where for programs which have a link, I would fetch the extra page, and call a second function, which will fill in the extra data (thankfully if pictures are present I only store their URL so it would stop there, no need for a third pass for pictures). It annoys me that the first function returns "incomplete" objects... It somehow feels wrong. Now that I mentioned my problem with more details, maybe you can think of a better way of doing that? And otherwise I guess this is the policy when writing Haskell code: absolutely avoid spreading impure/IO tainted code, even if it maybe negatively affects the general structure of the program? Thanks again for the tip though! That's definitely what I'll do if nothing better is suggested. It is actually probably the best way to do that if you want to separate IO from "pure" code. Emmanuel

There's a better option in my opinion. Use the monad transformer
capability of the parser you are using (I'm assuming you are using parsec
for parsing).
If you check the hackage docs for parsec you'll see that the ParsecT is an
instance of MonadIO. That means at any point during the parsing you can go
liftIO $ <any IO action> and use the result in your parsing. Here's an
example of what that would might look like.
import Control.Monad.IO.Class
import Control.Monad (when)
import Text.Parsec
import Text.Parsec.Char
parseTvStuff :: (MonadIO m) => ParsecT String u m (Char,Maybe ())
parseTvStuff = do
string "tvshow:"
c <- anyChar
morestuff <- if c == 'x'
then fmap Just $ liftIO $ putStrLn "run an http request, parse the
result, and store the result in morestuff as a maybe"
else return Nothing
return (c,morestuff)
So you will run an http request if you get back something that seems like
it could be worth further parsing. Then you just parse that stuff with a
separate parser and store it in your data structure and continue parsing
the rest of the first page with the original parser if you wish.
On Fri, Oct 12, 2012 at 9:28 AM, Emmanuel Touzery
Hi,
when parsing the string representing a page, you could
save all the links you encounter.
After the parsing you would load the linked pages and start again parsing.
You would redo this until no more links are returned or a maximum deepness is reached.
Thanks for the tip. That sounds much more reasonable than what I mentioned. It seems a bit "spaghetti" to me though in a way (but maybe I just have to get used to the Haskell way).
To be more specific about what I want to do: I want to parse TV programs. On the first page I have the daily listing for a channel. start/end hour, title, category, and link or not. To fully parse one TV program I can follow the link if it's present and get the extra info which is there (summary, pictures..).
So the first scheme that comes to mind is a method which takes the DOM tree of the daily page and returns the list of programs for that day.
Instead, what I must then do, is to return the incomplete programs: the data object would have the link filled in, if it's available, but the summary, picture... would be empty. Then I have a "second pass" in the caller function, where for programs which have a link, I would fetch the extra page, and call a second function, which will fill in the extra data (thankfully if pictures are present I only store their URL so it would stop there, no need for a third pass for pictures).
It annoys me that the first function returns "incomplete" objects... It somehow feels wrong.
Now that I mentioned my problem with more details, maybe you can think of a better way of doing that?
And otherwise I guess this is the policy when writing Haskell code: absolutely avoid spreading impure/IO tainted code, even if it maybe negatively affects the general structure of the program?
Thanks again for the tip though! That's definitely what I'll do if nothing better is suggested. It is actually probably the best way to do that if you want to separate IO from "pure" code.
Emmanuel
______________________________**_________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/**mailman/listinfo/beginnershttp://www.haskell.org/mailman/listinfo/beginners

Hello, Thanks for the tip! I'm in fact using dom-selector: http://hackage.haskell.org/package/dom-selector which is based on xml-conduit and html-conduit. The reason being that it offers CSS selectors and is generally much higher-level than what I would do with parsec. So I'm not sure whether what you wrote applies. Actually your function doing the parsing here is not pure as such, it's a do block and ordered. What I have done so far is that dom-selector gives me the DOM structure of the page (so that parsing part is done for me), and then I give to my function that DOM structure and the examination of that DOM structure is completely without a do block, it's not ordered, it's pure. In that way my "parsing" (really examination of the DOM tree) is completely split of any IO or other monad. I think when you are within parsec as you mentioned, you are within the parsec monad (bear in mind I don't really understand all of this for now), and to do IO you need to go to the IO monad, and for that you use liftIO. In that case that's another problem than the one I'm having. Emmanuel On 12.10.2012 15:39, David McBride wrote:
There's a better option in my opinion. Use the monad transformer capability of the parser you are using (I'm assuming you are using parsec for parsing).
If you check the hackage docs for parsec you'll see that the ParsecT is an instance of MonadIO. That means at any point during the parsing you can go liftIO $ <any IO action> and use the result in your parsing. Here's an example of what that would might look like.
import Control.Monad.IO.Class import Control.Monad (when) import Text.Parsec import Text.Parsec.Char
parseTvStuff :: (MonadIO m) => ParsecT String u m (Char,Maybe ()) parseTvStuff = do string "tvshow:" c <- anyChar morestuff <- if c == 'x' then fmap Just $ liftIO $ putStrLn "run an http request, parse the result, and store the result in morestuff as a maybe" else return Nothing return (c,morestuff)
So you will run an http request if you get back something that seems like it could be worth further parsing. Then you just parse that stuff with a separate parser and store it in your data structure and continue parsing the rest of the first page with the original parser if you wish.
On Fri, Oct 12, 2012 at 9:28 AM, Emmanuel Touzery
mailto:etouzery@gmail.com> wrote: Hi,
when parsing the string representing a page, you could save all the links you encounter.
After the parsing you would load the linked pages and start again parsing.
You would redo this until no more links are returned or a maximum deepness is reached.
Thanks for the tip. That sounds much more reasonable than what I mentioned. It seems a bit "spaghetti" to me though in a way (but maybe I just have to get used to the Haskell way).
To be more specific about what I want to do: I want to parse TV programs. On the first page I have the daily listing for a channel. start/end hour, title, category, and link or not. To fully parse one TV program I can follow the link if it's present and get the extra info which is there (summary, pictures..).
So the first scheme that comes to mind is a method which takes the DOM tree of the daily page and returns the list of programs for that day.
Instead, what I must then do, is to return the incomplete programs: the data object would have the link filled in, if it's available, but the summary, picture... would be empty. Then I have a "second pass" in the caller function, where for programs which have a link, I would fetch the extra page, and call a second function, which will fill in the extra data (thankfully if pictures are present I only store their URL so it would stop there, no need for a third pass for pictures).
It annoys me that the first function returns "incomplete" objects... It somehow feels wrong.
Now that I mentioned my problem with more details, maybe you can think of a better way of doing that?
And otherwise I guess this is the policy when writing Haskell code: absolutely avoid spreading impure/IO tainted code, even if it maybe negatively affects the general structure of the program?
Thanks again for the tip though! That's definitely what I'll do if nothing better is suggested. It is actually probably the best way to do that if you want to separate IO from "pure" code.
Emmanuel
_______________________________________________ Beginners mailing list Beginners@haskell.org mailto:Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Hi Emmanuel,
Now that I mentioned my problem with more details, maybe you can think of a better way of doing that?
In this case I don't think that it's worth to separate it.
And otherwise I guess this is the policy when writing Haskell code: absolutely avoid spreading impure/IO tainted code, even if it maybe negatively affects the general structure of the program?
There should be a reason for separating pure and impure code. If your code doesn't get easier to reason about or more reusable, than there's little reason for separation. At the end the separation should result in better programs. Greetings, Daniel

Hi, On Fri, Oct 12, 2012 at 03:28:39PM +0200, Emmanuel Touzery wrote:
It annoys me that the first function returns "incomplete" objects... It somehow feels wrong.
Maybe you would feel better about it if you put both functions under one "umbrella" function like this:
parseProgramme = getDetails . getProgramme where getProgramme = ... getDetails = ...
That way, your "incomplete" objects would never be exposed to "end user" (even though it's just you). It also gives you an abstraction that may gain you in a future when, say, you would want to fetch pictures as well — it would be just a matter of adding one more function under the "umbrella". Overall, splitting your algorithm into simple steps — steps that would do just a part of work and return incomplete objects — is the way to go. -- Regards, Alexander Batischev PGP key 356961A20C8BFD03 Fingerprint: CE6C 4307 9348 58E3 FD94 A00F 3569 61A2 0C8B FD03

Hello,
Maybe you would feel better about it if you put both functions under one "umbrella" function like this:
parseProgramme = getDetails . getProgramme where getProgramme = ... getDetails = ... That way, your "incomplete" objects would never be exposed to "end user" (even though it's just you). It also gives you an abstraction that may gain you in a future when, say, you would want to fetch pictures as well — it would be just a matter of adding one more function under the "umbrella".
Overall, splitting your algorithm into simple steps — steps that would do just a part of work and return incomplete objects — is the way to go.
You have a point, about splitting code for smaller functions. I would just rather have getDetails called from getProgramme rather than a parent calling both separately. And the parent must do the connection by doing the IO if I want both pieces to be pure. That is what is bothering me mostly. Emmanuel

On Oct 12, 2012, at 7:19 AM, Emmanuel Touzery wrote:
Overall, splitting your algorithm into simple steps — steps that would do just a part of work and return incomplete objects — is the way to go.
You have a point, about splitting code for smaller functions. I would just rather have getDetails called from getProgramme rather than a parent calling both separately. And the parent must do the connection by doing the IO if I want both pieces to be pure. That is what is bothering me mostly.
Think about this from a testing perspective. How do you verify that your code which identifies links is working? If the link finding is mixed in with the link retrieving you end up having to dummy out the IO. Think of this as the code becomes more complicated and like Alexander suggests you want to later retrieve images too. Now you need to mock out the image retrieval as well. Perhaps you should think of this as creating a matching DOM like structure. First you tree starts out empty. Then you parse the top level and return a new tree with data and dangling nodes that are links needing to be followed. You check "have I gone as deep as I would like?". If not, pass in the new partial tree to the retrieval routine and start filling it in. Now you are back to the depth check. When the retrieval has reached its goal the tree is returned and it is as populated as it can be. Now the rest of your code can use the tree for whatever it needs. Remember to always ask "how do I test this?". One of the key reasons to keep purity is it makes the testing so much easier. Every small piece can be verified.

On Fri, Oct 12, 2012 at 8:58 PM, Sean Perry
On Oct 12, 2012, at 7:19 AM, Emmanuel Touzery wrote:
Overall, splitting your algorithm into simple steps — steps that would do just a part of work and return incomplete objects — is the way to go.
You have a point, about splitting code for smaller functions. I would just rather have getDetails called from getProgramme rather than a parent calling both separately. And the parent must do the connection by doing the IO if I want both pieces to be pure. That is what is bothering me mostly.
Think about this from a testing perspective. How do you verify that your code which identifies links is working? If the link finding is mixed in with the link retrieving you end up having to dummy out the IO. Think of this as the code becomes more complicated and like Alexander suggests you want to later retrieve images too. Now you need to mock out the image retrieval as well.
Perhaps you should think of this as creating a matching DOM like structure. First you tree starts out empty. Then you parse the top level and return a new tree with data and dangling nodes that are links needing to be followed. You check "have I gone as deep as I would like?". If not, pass in the new partial tree to the retrieval routine and start filling it in. Now you are back to the depth check. When the retrieval has reached its goal the tree is returned and it is as populated as it can be. Now the rest of your code can use the tree for whatever it needs.
Remember to always ask "how do I test this?". One of the key reasons to keep purity is it makes the testing so much easier. Every small piece can be verified.
Thank you for your opinion, it does bring another set of concerns. What you suggest is the approach suggested by Daniel Trstenjak, the very first answer, and it definitely has value, but the question is code readability. It's a fine balance. That's what I was asking at the beginning, how hard should we try to strive for pure code, and here the balance seems to depend on the person (while I thought it's a dogma in the Haskell community, as much pure code as possible). In this case I think the purity means more code to be written (re-reading and re-writing data structures instead of writing them just once) and I'm not sure it's worth the cost, I'd say Daniel Trstenjak's second answer convinced me, but I'm just starting with Haskell and I guess I'll get a clearer sense of this with time. But it's also good to see that there is consensus on how to code this, if we want to maximize pure code. Emmanuel

On Oct 12, 2012, at 12:44 PM, Emmanuel Touzery wrote:
But it's also good to see that there is consensus on how to code this, if we want to maximize pure code.
I am still working my way through Haskell as well. I still code more in Python or C++. In my experience mixed code and I/O is faster to develop early on. Then features start coming in, the code size grows and you start to think about maintaining it. In Python or C++ this is when you would start thinking about refactoring the I/O out of the code. I like that Haskell gives me strong nudges in this direction from the beginning. Typical first draft of a Python function: def foo(filename): # open file # read data # work on data # return result typical second draft: def foo2(handle): # read data from handle, now it works with all kinds of I/O sources # work on data # return result common ending point: def foo3(data): # work on data, now we do not care where the data came from # return result In programming we often trade one efficiency for another. If the parsing takes a few more seconds but I can hand the code off to someone else to add the next feature then that is a worthwhile trade off for me. Or more often, I only touch the code a few times a year. The smaller and better contained the pieces the quicker I can get back to work. A frequent problem I have as a paid developer is dealing with code bases where management fears change because testing was not a consideration. So now a few years in there is a pile of code that was well thought out originally but now is a jumbled mess and no one knows what happens when you twiddle just one part of it. In my opinion every developer needs to internalize this and plan for the future from the beginning. Now, we know that some things are going to be 30 lines long and not change much later. In those cases over engineering is not worth it, obviously. But avoiding better design out of concern for performance is a path to failure.

Hi,
Then I have a "second pass" in the caller function, where for programs which have a link, I would fetch the extra page, and call a second function, which will fill in the extra data (thankfully if pictures are present I only store their URL so it would stop there, no need for a third pass for pictures).
It annoys me that the first function returns "incomplete" objects... It somehow feels wrong.
I just realized i have the wrong way of thinking about it, in Haskell data is immutable therefore the first function wouldn't return incomplete "objects" that would be completed later: the second function will re-create completely the data anyway. So I would have duplicate data structures, once without the extra data, once with. Or something like that.
And otherwise I guess this is the policy when writing Haskell code: absolutely avoid spreading impure/IO tainted code, even if it maybe negatively affects the general structure of the program? There should be a reason for separating pure and impure code. If your code doesn't get easier to reason about or more reusable, than there's little reason for separation.
Yes.. I thought the goal is to strive for as much pure code as possible (which is easier to test and so on), but in this case (and obviously it's a small program) it doesn't seem tractable. I wonder what is the ratio pure/impure in bigger programs. Thank you... Emmanuel

On Fri, Oct 12, 2012 at 8:28 AM, Emmanuel Touzery
Hi,
when parsing the string representing a page, you could save all the links you encounter.
After the parsing you would load the linked pages and start again parsing.
You would redo this until no more links are returned or a maximum deepness is reached.
Thanks for the tip. That sounds much more reasonable than what I mentioned. It seems a bit "spaghetti" to me though in a way (but maybe I just have to get used to the Haskell way).
To be more specific about what I want to do: I want to parse TV programs. On the first page I have the daily listing for a channel. start/end hour, title, category, and link or not. To fully parse one TV program I can follow the link if it's present and get the extra info which is there (summary, pictures..).
If this were me, I would write the following: data ChannelListing = ChannelListing [BasicProgramInfo] -- | Summary of a program data BasicProgramInfo = BasicProgramInfo { basicStartTime :: ... , basicEndTime :: ... , basicTitle :: ... , basicUrl :: URL } -- | Full details of a program data ProgramInfo = ... fetchChannelListing :: ChannelId -> IO ChannelListing fetchProgramInfo :: BasicProgramInfo -> IO ProgramInfo And then I would string my program together from these primitives. That way large portions of the code can be built up from the pure data types, but the top-level can load them up as needed with impure functions. This is just my first impression, though. Antoine
participants (6)
-
Alexander Batischev
-
Antoine Latter
-
Daniel Trstenjak
-
David McBride
-
Emmanuel Touzery
-
Sean Perry