Yahoo Slurp is confused - question-mark rewrite help needed

Using mod_rewrite to handle various content issues

Yahoo Slurp is confused - question-mark rewrite help needed

Postby cybe » Tue May 15, 2007 6:53 am

Yahoo Slurp is confused and indexing an unlimited? amount of nonexistent urls on my Drupal site. I need help to mod_rewrite some nonexistant pages to
be "410 Gone".


I've been trying to fix it for two days and finally ended up back here.

Code: Select all
http://site/cars?page=1
http://site/cars?page=2
http://site/cars?page=10&from=100

These do exist....

.... and I've recently banned them with robots.txt since Yahoo Slurp will not find anything new there, only lots of duplicate contents by changing the numbers...

However, Slurp seems to read robots.txt very seldom.

Now I see Slurp is indexing nonexistant pages

Code: Select all
http://site/node/321


is an article, just one page... but Slurp has started thinking there are many pages of it:

Code: Select all
http://site/node?page=2
http://site/node?page=312&from=100


This prepending of ?something after almost any URL on a PHP website is possible. [Link removed by richardk: /posting.php?hello_whats_happening_?]I don't know why Slurp started doing it on my site.


I've been trying to mod_rewrite it away but have not been successful. Here are a couple of attempts that do NOT work. Can anyone please help?
Code: Select all
RewriteRule ^node/(.*)?page - [G,L]


Code: Select all
  RewriteCond %{QUERY_STRING} node/([0-9]+)\?page
  RewriteRule - [G,L]



Code: Select all
  RewriteCond %{QUERY_STRING} \?node/([0-9]+)
  RewriteCond %{QUERY_STRING} page=([0-9]+)
  RewriteRule - [G,L]

cybe
 
Posts: 6
Joined: Tue Oct 31, 2006 9:32 am
Location: Finland

Postby richardk » Tue May 15, 2007 9:16 am

This prepending of ?something after almost any URL on a PHP website is possible. [Link removed by richardk: /posting.php?hello_whats_happening_?] I don't know why Slurp started doing it on my site.

Maybe someone linked to it like that was. Then the bot will look for other pages by changing the values.

The query string is everything after the ?
Code: Select all
Options +FollowSymLinks

RewriteEngine On

RewriteCond %{QUERY_STRING} ^(.*&)?page=[0-9]+(&.*)?$ [NC]
RewriteRule ^node(/.*)?$ - [G,L]


If you wanted to remove all query strings for /node/*, replace ^(.*&)?page=[0-9]+(&.*)?$ with !^$.
richardk
 
Posts: 8800
Joined: Wed Dec 21, 2005 7:50 am

Postby cybe » Tue May 15, 2007 2:21 pm

Thank you very much richardk. It works.
cybe
 
Posts: 6
Joined: Tue Oct 31, 2006 9:32 am
Location: Finland

Postby cybe » Tue May 15, 2007 3:08 pm

But it also filters out the legitimate ones

Code: Select all
http://100777.com/?page=8
http://100777.com/node?page=8


which are Drupal category listings for instance

vs nonexistant
Code: Select all
http://100777.com/node/321?page=8


=(
cybe
 
Posts: 6
Joined: Tue Oct 31, 2006 9:32 am
Location: Finland

Postby cybe » Tue May 15, 2007 11:08 pm

I'll just add a
RewriteCond %{HTTP_USER_AGENT} Slurp [OR]


so it's only for Slurp. This will be ok since it doesn't need to go to any page=num since it should still find all the articles.
cybe
 
Posts: 6
Joined: Tue Oct 31, 2006 9:32 am
Location: Finland

Postby richardk » Thu May 17, 2007 11:25 am

Remove the [OR] (replace it with [NC]) or it'll match all slurp requests.
richardk
 
Posts: 8800
Joined: Wed Dec 21, 2005 7:50 am


Return to Content

Who is online

Users browsing this forum: No registered users and 13 guests

cron