
as was pointed out on the programming reddit [1], crawling of the haskell wiki is forbidden, since http://www.haskell.org/robots.txt contains User-agent: * Disallow: /haskellwiki/ and indeed, a google search gives the old wiki http://www.google.ch/search?q=haskell+wiki i.e. http://haskell.org/hawiki/FrontPage rather than http://haskell.org/haskellwiki/Haskell in other words, it seems like the most relevant search engines do not index the haskell wiki. i have reported this on #haskell about a week ago[2], and this setting was acknowledged to be a measure deliberately taken to curb excessive load from spiders crawling the wiki. i still believe this a highly harmful stance, and just to make sure that whoever may feel concerned is at least aware of the problem, i am now cross-posting to this list as well. kind regards, v. [1] http://programming.reddit.com/info/1qzjx/comments/c1r3ok [2] http://tuukka.iki.fi/tmp/haskell-2007-05-16.html

On Tue, 22 May 2007, Vincent Kraeutler wrote:
as was pointed out on the programming reddit [1], crawling of the haskell wiki is forbidden, since http://www.haskell.org/robots.txt contains
User-agent: * Disallow: /haskellwiki/
and indeed, a google search gives the old wiki http://www.google.ch/search?q=haskell+wiki i.e. http://haskell.org/hawiki/FrontPage rather than http://haskell.org/haskellwiki/Haskell
This also applies to Haskell mailing lists as I mentioned recently: http://www.haskell.org/pipermail/haskell-cafe/2007-April/025006.html

as was pointed out on the programming reddit [1], crawling of the haskell wiki is forbidden, since http://www.haskell.org/robots.txt contains
User-agent: * Disallow: /haskellwiki/
i agree that having the wiki searchable would be preferred, but was told that there were performance issues. even giving Googlebot a wider range than other spiders won't help if, as the irc page suggests, some of those faulty bots pretend to be Googlebot..
This also applies to Haskell mailing lists as I mentioned recently: http://www.haskell.org/pipermail/haskell-cafe/2007-April/025006.html
ah, yes, sorry. there was an ongoing offlist discussion at the time, following an earlier thread on ghc-users. Simon M has since changed robots.txt to the above, which *does* permit indexing of the pipermail archives, as long as google can find them. that still doesn't mean that they'll show up first in google's ranking system. for instance, if you google for 'ghc manuals online' (that's the subject for that earlier thread i mentioned), you'll get mail-archive and nabble first, but the haskell.org archives are there as well now, as you can see by googling for 'ghc manuals online inurl:pipermail' also, the standard test of googling for 'site:haskell.org' looks a lot healthier these days. and googling for 'inurl:ghc/docs/latest LANGUAGE pragma' gives me two relevant answers (not the most specific sub-page). so the situation for mailing lists and online docs seems to have improved, but there is still the wiki indexing/rogue bot issue, and lots of fine tuning (together with watching the logs to spot any issues arising out of relaxing those restrictions). perhaps someone on this list would be willing to volunteer to look into those robots/indexing issues on haskell.org?-) claus

On Tue, 2007-05-22 at 14:40 +0100, Claus Reinke wrote:
so the situation for mailing lists and online docs seems to have improved, but there is still the wiki indexing/rogue bot issue, and lots of fine tuning (together with watching the logs to spot any issues arising out of relaxing those restrictions). perhaps someone on this list would be willing to volunteer to look into those robots/indexing issues on haskell.org?-)
The main problem, and the reason for the original (temporary!) measure was bots indexing all possible diffs between old versions of wiki pages. URLs like: http://haskell.org/haskellwiki/?title=Quicksort&diff=9608&oldid=9607 For pages with long histories this O(n^2) number of requests starts to get quite large and the wiki engine does not seem well optimised for getting arbitrary diffs. So we ended up with bots holding open many http server connections. They were not actually causing much server cpu load or generating much traffic but once the number of nearly hung connections got up to the http child process limit then we are effectively in a DOS situation. So if we can ban bots from the page histories or turn them off for the bot user agents or something then we might have a cure. Perhaps we just need to upgrade our media wiki software or find out how other sites using this software deal with the same issue of bots reading page histories. Duncan

On Tue, 22 May 2007, Duncan Coutts wrote:
So if we can ban bots from the page histories or turn them off for the bot user agents or something then we might have a cure. Perhaps we just need to upgrade our media wiki software or find out how other sites using this software deal with the same issue of bots reading page histories.
What about adding the "nofollow" flag in the meta tags of the history pages?

On Tue, 2007-05-22 at 16:26 +0200, Henning Thielemann wrote:
On Tue, 22 May 2007, Duncan Coutts wrote:
So if we can ban bots from the page histories or turn them off for the bot user agents or something then we might have a cure. Perhaps we just need to upgrade our media wiki software or find out how other sites using this software deal with the same issue of bots reading page histories.
What about adding the "nofollow" flag in the meta tags of the history pages?
Sounds like a good idea. If someone can do that then great. Duncan

So if we can ban bots from the page histories or turn them off for the bot user agents or something then we might have a cure. Perhaps we just need to upgrade our media wiki software or find out how other sites using this software deal with the same issue of bots reading page histories.
What about adding the "nofollow" flag in the meta tags of the history pages?
Sounds like a good idea. If someone can do that then great.
Apparently other people have used URL rewriting to keep robots of subsets of the wiki pages: http://www.gustavus.edu/gts/webservices/2006/08/14/robots-in-the-wiki/ http://codex.gallery2.org/Gallery2:How_to_keep_robots_off_CPU_intensive_ pages Is that an option for our installation? Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

On Tue, 22 May 2007 15:05:48 +0100
Duncan Coutts
On Tue, 2007-05-22 at 14:40 +0100, Claus Reinke wrote:
so the situation for mailing lists and online docs seems to have improved, but there is still the wiki indexing/rogue bot issue, and lots of fine tuning (together with watching the logs to spot any issues arising out of relaxing those restrictions). perhaps someone on this list would be willing to volunteer to look into those robots/indexing issues on haskell.org?-)
The main problem, and the reason for the original (temporary!) measure was bots indexing all possible diffs between old versions of wiki pages. URLs like:
http://haskell.org/haskellwiki/?title=Quicksort&diff=9608&oldid=9607
For pages with long histories this O(n^2) number of requests starts to get quite large and the wiki engine does not seem well optimised for getting arbitrary diffs. So we ended up with bots holding open many http server connections. They were not actually causing much server cpu load or generating much traffic but once the number of nearly hung connections got up to the http child process limit then we are effectively in a DOS situation.
So if we can ban bots from the page histories or turn them off for the bot user agents or something then we might have a cure. Perhaps we just need to upgrade our media wiki software or find out how other sites using this software deal with the same issue of bots reading page histories.
http://en.wikipedia.org/robots.txt Wikipedia uses URLs starting with /w/ for "dynamic" pages (well, all pages are dynamic in a sense, but you know what I mean I hope.) And then puts /w/ in robots.txt. -- Robin

On 5/22/07, Robin Green
On Tue, 22 May 2007 15:05:48 +0100 Duncan Coutts
wrote: On Tue, 2007-05-22 at 14:40 +0100, Claus Reinke wrote:
so the situation for mailing lists and online docs seems to have improved, but there is still the wiki indexing/rogue bot issue, and lots of fine tuning (together with watching the logs to spot any issues arising out of relaxing those restrictions). perhaps someone on this list would be willing to volunteer to look into those robots/indexing issues on haskell.org?-)
The main problem, and the reason for the original (temporary!) measure was bots indexing all possible diffs between old versions of wiki pages. URLs like:
http://haskell.org/haskellwiki/?title=Quicksort&diff=9608&oldid=9607
For pages with long histories this O(n^2) number of requests starts to get quite large and the wiki engine does not seem well optimised for getting arbitrary diffs. So we ended up with bots holding open many http server connections. They were not actually causing much server cpu load or generating much traffic but once the number of nearly hung connections got up to the http child process limit then we are effectively in a DOS situation.
So if we can ban bots from the page histories or turn them off for the bot user agents or something then we might have a cure. Perhaps we just need to upgrade our media wiki software or find out how other sites using this software deal with the same issue of bots reading page histories.
http://en.wikipedia.org/robots.txt
Wikipedia uses URLs starting with /w/ for "dynamic" pages (well, all pages are dynamic in a sense, but you know what I mean I hope.) And then puts /w/ in robots.txt.
Does anyone know the status of applying a workaround such as this? I really miss being able to find things on the haskell wiki via google search. I don't like the mediawiki search at all. I did a google search earlier tonight but I didn't get wiki pages so I assume nothing has been done yet. Please make the wiki indexed again as soon as possible (if at all possible). Otheriwise, I feel like it's a waste of time to keep contributing to wiki pages. Thanks, Jason

Jason Dagit wrote:
On 5/22/07, Robin Green
wrote: On Tue, 22 May 2007 15:05:48 +0100 Duncan Coutts
wrote: On Tue, 2007-05-22 at 14:40 +0100, Claus Reinke wrote:
so the situation for mailing lists and online docs seems to have improved, but there is still the wiki indexing/rogue bot issue, and lots of fine tuning (together with watching the logs to spot any issues arising out of relaxing those restrictions). perhaps someone on this list would be willing to volunteer to look into those robots/indexing issues on haskell.org?-)
The main problem, and the reason for the original (temporary!) measure was bots indexing all possible diffs between old versions of wiki pages. URLs like:
http://haskell.org/haskellwiki/?title=Quicksort&diff=9608&oldid=9607
For pages with long histories this O(n^2) number of requests starts to get quite large and the wiki engine does not seem well optimised for getting arbitrary diffs. So we ended up with bots holding open many http server connections. They were not actually causing much server cpu load or generating much traffic but once the number of nearly hung connections got up to the http child process limit then we are effectively in a DOS situation.
So if we can ban bots from the page histories or turn them off for the bot user agents or something then we might have a cure. Perhaps we just need to upgrade our media wiki software or find out how other sites using this software deal with the same issue of bots reading page histories.
http://en.wikipedia.org/robots.txt
Wikipedia uses URLs starting with /w/ for "dynamic" pages (well, all pages are dynamic in a sense, but you know what I mean I hope.) And then puts /w/ in robots.txt.
Does anyone know the status of applying a workaround such as this? I really miss being able to find things on the haskell wiki via google search. I don't like the mediawiki search at all.
The status is that nobody has stepped up and volunteered to look after haskell.org's robots.txt file. It needs someone with the time and experience to look into what needs doing, make the changes, fix problems as the arise, and update it as necessary in the future. Anyone? Cheers, Simon

So if we can ban bots from the page histories or turn them off for the bot user agents or something then we might have a cure. Perhaps we just need to upgrade our media wiki software or find out how other sites using this software deal with the same issue of bots reading page histories.
The wiki could be configured to use /haskellwiki/index.php?.. urls for diffs (I believe this can be done by changing $wgScript). Then robots.txt could be changed to
Disallow: /haskellwiki/index.php Which bans robots from everything except normal pages.
Twan

The wiki could be configured to use /haskellwiki/index.php?.. urls for diffs (I believe this can be done by changing $wgScript). Then robots.txt could be changed to
Disallow: /haskellwiki/index.php Which bans robots from everything except normal pages.
that sounds like the most promising approach to me (meta tags for history pages already have noindex,nofollow; so that didn't help, it seems? also, fewer robots look at meta tags than at robots.txt). claus
participants (9)
-
Bayley, Alistair
-
Claus Reinke
-
Duncan Coutts
-
Henning Thielemann
-
Jason Dagit
-
Robin Green
-
Simon Marlow
-
Twan van Laarhoven
-
Vincent Kraeutler