
Hi, Whenever I write a program which has to interface with the web (scraping, POSTing, whatever), I never know how to properly test it. What I have been doing up to date is fetching some pages ahead of time, saving locally and running my parsers or whatever it is I'm coding at the moment against that. The problem with this approach is that we can't test a whole lot: if we have a crawler, how do we test it it goes to the next page properly? Testing things like logging in and such seems close to impossible, we can only test if we are making a good POST. Let's stick to a crawler example. How would you test that it follows links? Do people set up local webservers with few dummy pages they download? Do you just inspect that GET and POST ‘look’ correct? Assume that we don't own the sites so we can't let the program run tests in the wild: page content might change (parser tests fail), APIs might change (unexpected stuff back), our account might be locked (guess they didn't like us logging in 20 times in last hour during tests) &c. Of course there is nothing which can prevent upstream changes but I'm wondering how we can test the more-or-less static stuff without calling out into the world. -- Mateusz K.

Mateusz Kowalczyk
Hi,
Whenever I write a program which has to interface with the web (scraping, POSTing, whatever), I never know how to properly test it. What I have been doing up to date is fetching some pages ahead of time, saving locally and running my parsers or whatever it is I'm coding at the moment against that.
The problem with this approach is that we can't test a whole lot: if we have a crawler, how do we test it it goes to the next page properly? Testing things like logging in and such seems close to impossible, we can only test if we are making a good POST.
Let's stick to a crawler example. How would you test that it follows links? Do people set up local webservers with few dummy pages they download? Do you just inspect that GET and POST ‘look’ correct?
Assume that we don't own the sites so we can't let the program run tests in the wild: page content might change (parser tests fail), APIs might change (unexpected stuff back), our account might be locked (guess they didn't like us logging in 20 times in last hour during tests) &c.
Of course there is nothing which can prevent upstream changes but I'm wondering how we can test the more-or-less static stuff without calling out into the world.
I'd use a "Self-Initialising Fake": http://martinfowler.com/bliki/SelfInitializingFake.html 1) Make sure you're not hard-coding the IO functions you're using, ie. use dependency injection, either via explicit parameters or via a typeclass. 2) If you wrote a typeclass or some other fancy abstraction for (1), implement it using the real HTTP procedures you want to use. 3) Write another implementation which reads canned responses from files, eg. comparing filenames to hashed requests. If no file is found, it should use the real HTTP implementation to get the data, store it in an appropriate file, then return it. Use the implementation from (2) in production (or just the raw procedures if you didn't abstract them) and use the implementation from (3) in tests. This lets you test real responses without having to rely on the network, without having to worry about hammering other people's machines, etc. Since all responses are static, it won't model dynamic server-side processing, but it sounds like you're OK with that. Note that caching won't work for randomised requests, eg. if your data is coming from QuickCheck. You can either limit your ranges to ensure more overlap, eg. using randomInt % 20 instead of randomInt, or write a custom pattern-matching/response-rewriting layer on top of the cache. For extra confidence that your tests are safe, eg. if you have some highly-randomised tests which will often miss the cache, you could also write a pure implementation based around a Data.Map, returning a canned 404 response for anything else. A simple driver function can populate the Map based on any existing cache files, assuring that the tests themselves are always pure. Cheers, Chris
participants (2)
-
Chris Warburton
-
Mateusz Kowalczyk