Using TIME_HOUR to prevent search engine crawling

New to mod_rewrite? This is a good place to start.

Using TIME_HOUR to prevent search engine crawling

Postby mwe » Tue Sep 15, 2009 6:46 am

One of our Drupal sites is getting hammered in the daytime by search engine bots due to a taxonomy feed we have available.

Code: Select all
192.168.xxx.xxx - - [15/Sep/2009:10:34:58 -0400] "GET /taxonomy/term/324/feed/ HTTP/1.0" 301 325 "-" "Yahoo-Newscrawler/3.9 (news-search-crawler at yahoo-inc dot com)"


Would something like the following prevent search engine bots from crawling a certain Drupal directory structure between a time frame (i.e. 6am until 11pm), but allowing everyone else to still access the feeds?

Code: Select all
  RewriteCond %{HTTP_USER_AGENT} ^.*(msnbot|googlebot|yahoo|newsbrain|rome) [NC]
  RewriteCond %{QUERY_STRING} ^taxonomy/term/[^/]+/feed$
  RewriteCond %{TIME_HOUR}%{TIME_MIN} >0600
  RewriteCond %{TIME_HOUR}%{TIME_MIN} <2300
  RewriteRule .* - [F]


-- M
mwe
 
Posts: 21
Joined: Fri Aug 10, 2007 7:12 am
Location: North Augusta, SC USA

Postby richardk » Tue Sep 15, 2009 5:58 pm

List the hours
Code: Select all
RewriteCond %{TIME_HOUR} ^(06|07|08|09|10|11|12|13|14|15|16|17|18|19|20|21|22)$

Or you can use a shorter regular expression
Code: Select all
RewriteCond %{TIME_HOUR} ^(0[6-9]|1[0-9]|2[0-2])$


Also, you probably want REQUEST_URI not QUERY_STRING. But you can use the RewriteRule anyway (that's what it's for).

Try
Code: Select all
  RewriteCond %{HTTP_USER_AGENT} (msnbot|googlebot|yahoo|newsbrain|rome) [NC]
  RewriteCond %{TIME_HOUR} ^(0[6-9]|1[0-9]|2[0-2])$
  RewriteRule ^taxonomy/term/[^/]+/feed/?$ - [F,L]
richardk
 
Posts: 8800
Joined: Wed Dec 21, 2005 7:50 am


Return to Beginner's Corner

Who is online

Users browsing this forum: Majestic-12 [Bot] and 23 guests

cron