- Code: Select all
#!/usr/bin/perl
$| = 1;
$host = $ENV{'REMOTE_HOST'};
$addr = $ENV{'REMOTE_ADDR'};
$agent = $ENV{'HTTP_USER_AGENT'};
print "Content-type: text/plain\n\n";
if ($host =~ /\.googlebot\.com$/i && $agent =~ /^Googlebot/) {
print <<'EOF';
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
EOF
} else {
print <<'EOF';
User-agent: *
Disallow: /
EOF
}
Virtual Host file layout...
- Code: Select all
<VirtualHost *>
ServerName mysite.com
ServerAlias www.mysite.com
DocumentRoot /web/mysite.com/htdocs
ErrorLog /var/web/mysite.com/error_log
CustomLog /var/web/mysite.com/access_log common
ScriptAlias /cgi-bin/ "/web/mysite.com/cgi-bin/"
RewriteEngine On
RewriteRule /robots\.txt$ /var/www/cgi-bin/robots.pl [L,T=application/x-httpd-cgi]
</VirtualHost>
I know by the rule above that it will read the robots.pl file, thus showing a robots.txt to the bot (or end user if triggered).
---
However, now I am finding that there are hundreds of new web-related bots hammer my sites, but as I have found out, most of these 'bots' don't even look at the robots.txt and still scrape the website regardless. I though of using something like a mod_rewrite setup for checking the HTTP_USER_AGENT listing these bots tags (i.e. FlickBot, SurveyBot), but when this list is going to be several lines long and would have to be in each Virtual Host entry, then I noticed this would be a pain to keep track of as new bots keep appearing.
- Code: Select all
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} FlickBot [OR]
RewriteCond %{HTTP_USER_AGENT} SurveyBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider¦ExtractorPro) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^E?Mail.?(Collect¦Harvest¦Magnet¦Reaper¦Siphon¦Sweeper¦Wolf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DTS.?Agent¦Email.?Extrac) [NC]
RewriteRule .* - [F,L]
As for my question -- is it possible to read in the RewriteCond entries (via a perl script, flat file, etc) prior to the RewriteRule being called like I do for creating a robots.txt file?
-- Michael