Using mod_rewrite to call external RewriteCond perl script

Using mod_rewrite to handle various content issues

Using mod_rewrite to call external RewriteCond perl script

Postby mwe » Tue Sep 08, 2009 6:11 am

For a number of sites I host, I am using a dynamic robots.txt file that is created is a unique Perl script as seen here (http://www.leekillough.com/robots.html) and this works great for sites that actually read the robots.txt file for the sites. Since the Perl script is actually dynamic meaning I can edit the file and the new rules are immediately picked up, it makes adding new changes fairly easy to keep up with.

Code: Select all
#!/usr/bin/perl

$| = 1;

$host  = $ENV{'REMOTE_HOST'};
$addr  = $ENV{'REMOTE_ADDR'};
$agent = $ENV{'HTTP_USER_AGENT'};

print "Content-type: text/plain\n\n";

if ($host =~ /\.googlebot\.com$/i && $agent =~ /^Googlebot/) {
    print <<'EOF';
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
EOF
} else {
    print <<'EOF';
User-agent: *
Disallow: /
EOF
}


Virtual Host file layout...
Code: Select all
<VirtualHost *>
  ServerName mysite.com
  ServerAlias www.mysite.com
   DocumentRoot /web/mysite.com/htdocs
   ErrorLog /var/web/mysite.com/error_log
   CustomLog /var/web/mysite.com/access_log common
   ScriptAlias /cgi-bin/ "/web/mysite.com/cgi-bin/"
   RewriteEngine On
   RewriteRule /robots\.txt$ /var/www/cgi-bin/robots.pl [L,T=application/x-httpd-cgi]
</VirtualHost>


I know by the rule above that it will read the robots.pl file, thus showing a robots.txt to the bot (or end user if triggered).

---

However, now I am finding that there are hundreds of new web-related bots hammer my sites, but as I have found out, most of these 'bots' don't even look at the robots.txt and still scrape the website regardless. I though of using something like a mod_rewrite setup for checking the HTTP_USER_AGENT listing these bots tags (i.e. FlickBot, SurveyBot), but when this list is going to be several lines long and would have to be in each Virtual Host entry, then I noticed this would be a pain to keep track of as new bots keep appearing.

Code: Select all
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} FlickBot [OR]
RewriteCond %{HTTP_USER_AGENT} SurveyBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider¦ExtractorPro) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^E?Mail.?(Collect¦Harvest¦Magnet¦Reaper¦Siphon¦Sweeper¦Wolf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DTS.?Agent¦Email.?Extrac) [NC]
RewriteRule .* - [F,L]


As for my question -- is it possible to read in the RewriteCond entries (via a perl script, flat file, etc) prior to the RewriteRule being called like I do for creating a robots.txt file?

-- Michael
Last edited by mwe on Tue Sep 08, 2009 8:14 am, edited 1 time in total.
mwe
 
Posts: 21
Joined: Fri Aug 10, 2007 7:12 am
Location: North Augusta, SC USA

Postby richardk » Tue Sep 08, 2009 8:13 am

You could use a RewriteMap. For example
Code: Select all
Options +FollowSymLinks

RewriteEngine On

# Can probably be set globally. May require the inherit from below.
RewriteMap lowercase int:tolower
RewriteMap badbots txt:/path/to/bad/bots.txt

# Can probably be set globally. Will require the inherit from below.
RewriteCond ${badbots:${lowercase:%{HTTP_USER_AGENT}|allow} !^allow$
RewriteRule .* - [F,L]


/path/to/bad/bots.txt
Code: Select all
useragent1 allow
useragent2 disallow

You can also use programs for RewriteMaps.

Or put the user agent mod_rewrite in the server configuration and use
Code: Select all
RewriteOptions Inherit

in the <VirtualHost> to inherit it.
richardk
 
Posts: 8800
Joined: Wed Dec 21, 2005 7:50 am

Postby mwe » Tue Sep 08, 2009 8:20 am

Thanks for the feedback/help.

One question though -- How would I setup the useragent for something like this...

Code: Select all
RewriteCond %{HTTP_USER_AGENT} ^(eCatch¦(Get¦Super)Bot¦Kapere¦HTTrack¦JOC¦Offline¦UtilMind¦Xaldon) [NC,OR]

RewriteCond %{HTTP_USER_AGENT} ^Web.?(Auto¦Cop¦dup¦Fetch¦Filter¦Gather¦Go¦Leach¦Mine¦Mirror¦Pix¦QL¦RACE¦Sauger) [NC,OR]

RewriteCond %{HTTP_USER_AGENT} ^Web.?(site.?(eXtractor¦Quester)¦Snake¦ster¦Strip¦Suck¦vac¦walk¦Whacker¦ZIP) [NC,OR]


-- M
mwe
 
Posts: 21
Joined: Fri Aug 10, 2007 7:12 am
Location: North Augusta, SC USA

Postby richardk » Tue Sep 08, 2009 4:09 pm

Do you mean in a txt RewriteMap? You would probably need to use a program instead.
richardk
 
Posts: 8800
Joined: Wed Dec 21, 2005 7:50 am


Return to Content

Who is online

Users browsing this forum: No registered users and 1 guest

cron