Remove complete filename with specific extension from URL

Using mod_rewrite to handle various content issues

Remove complete filename with specific extension from URL

Postby Marcelo » Tue Sep 30, 2008 10:12 am

Hello,

I have a problem I'm having trouble figuring out and was not able to find a similar problem here at the forum.

I want to be able to redirect a direct request to a pdf file from a search engine, e.g. Google to a html file with a link to the PDF file, in this way I can save some bandwidth from people directly downloading something they do not want to read. AS my pdf-files are on average 10 MB, this can add up.

I want to to this by rewriting e.g.

http://www.mysite.com/magazine/articles/articles.pdf

to

http://www.mysite.com/magazine/articles/

In this way, the index.html gets served instead of the large PDF.

Please notice that the directory structure is not fixed, the files can be in different folders at different levels. That is why I only want to look at the filename at the end.

Also, I also would like to test wheter the extension is a pdf before a rewrite, as this avoids the server carrying out unneccesary rewrites.

I also want to enable Google and other search engines to index the PDF directly, so I do not want the html page to be served to the search engine.

I think I got the Search engine part correct (Google in the example below, more to be added), but I am not ale to remove the complete filename, only the PDF extension, see below. Practically anything I try gives a server error.

How to test for the file extension I don't really know.

Code: Select all
Options +FollowSymLinks
RewriteBase /
RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} !googlebot [NC]
RewriteRule  ^/?([a-z/]+)\.pdf$ $1/ [NC]


Your comments and suggestions are much appreciated. :)
Marcelo
 
Posts: 9
Joined: Tue Nov 06, 2007 6:03 am

Postby Marcelo » Tue Sep 30, 2008 2:00 pm

I've been working on the problem and I think I came up with a reasonable solution, see below. The only thing I can't figure out yet is how to test whether the file is a PDF before the rewrite.

I am also open to suggestions on how to do it in another way, especially if it can improve server performance. Also, I am interested in learning a little bit more by seeing other examples. There is probably a more elegant way than this...

Enough for now, it is getting late and I'm getting tired...

Code: Select all
Options +FollowSymLinks
RewriteBase /
RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} !(googlebot|slurp|twiceler|msnbot) [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?puxa\.nl.*$ [NC]
RewriteRule  ^/?([a-z0-9/]+)/([a-z0-9]+)\.pdf$ $1/ [NC]


Below some comments per line

#Check which user agent is requesting the file. If in this list, it is allowed to index. Google, Yahoo, Live search and Cuil have been included.
Code: Select all
RewriteCond %{HTTP_USER_AGENT} !(googlebot|slurp|twiceler|msnbot) [NC]


# If not from mydomain.com, do not allow to index but serve the index.html If you do not add this line, nobody except the search engines above will be able to access the files
Code: Select all
RewriteCond %{HTTP_REFERER} !^http://(www\.)?mydomain\.com.*$ [NC]


#Split into two variables, one is everything before the filename, the other the filename without extension. Only use the first variable in the right part.
Code: Select all
RewriteRule  ^/?([a-z0-9/]+)/([a-z0-9]+)\.pdf$ $1/ [NC]
Marcelo
 
Posts: 9
Joined: Tue Nov 06, 2007 6:03 am

Postby richardk » Wed Oct 01, 2008 10:42 am

The only thing I can't figure out yet is how to test whether the file is a PDF before the rewrite.

Do you mean to test if the file exists, or if it is PDF format?

To test if the file exists
Code: Select all
Options +FollowSymLinks

RewriteEngine On

# Make sure the requested (.pdf) file exists.
RewriteCond %{SCRIPT_FILENAME} -f
RewriteCond %{HTTP_USER_AGENT} !(googlebot|slurp|twiceler|msnbot) [NC]
RewriteCond %{HTTP_REFERER} !^(http://(www\.)?puxa\.nl(/.*)?)?$ [NC]
RewriteRule  ^(.+)/([^/]+)\.pdf$ /$1/ [NC,L]
richardk
 
Posts: 8800
Joined: Wed Dec 21, 2005 7:50 am

Postby Marcelo » Thu Oct 02, 2008 12:45 am

Thanks for the reply Richard.

No, in this case I know the file exists, I just need to know if the request is for a PDF-file or e.g. a .html file. In case it is a .html file, I do not need to rewrite, in case it is a PDF, I need to go and check who requested it and where from.

Will adding the following work?

Code: Select all
RewriteCond %{SCRIPT_FILENAME} (pdf) [NC]


I have not tried it yet as I do not have access to a test server right now and do not want to mess with the production server.

Thanks for rewriting the code. I assume you should use as little "*" as possible and e.g. "?" to limit the amount of work the server has to do?

However, one line does not seem to work:

Code: Select all
RewriteCond %{HTTP_REFERER} !^(http://(www\.)?puxa\.nl(/.*)?)?$ [NC]


If I rewrite it to:

Code: Select all
RewriteCond %{HTTP_REFERER} !^http://(www\.)?puxa\.nl(/.*)?$ [NC]


It works again. Any idea why?
Marcelo
 
Posts: 9
Joined: Tue Nov 06, 2007 6:03 am

Postby richardk » Thu Oct 02, 2008 3:33 am

No, in this case I know the file exists

You probably don't have a richardk.pdf file, though, so this would not match and a 404 header would be sent.

I just need to know if the request is for a PDF-file or e.g. a .html file. In case it is a .html file, I do not need to rewrite, in case it is a PDF, I need to go and check who requested it and where from.

Will adding the following work?
Code: Select all
RewriteCond %{SCRIPT_FILENAME} (pdf) [NC]

This check (that it is a .pdf file) is done in the RewriteRule
Code: Select all
^(.+)/([^/]+)\.pdf$


However, one line does not seem to work:
Code: Select all
RewriteCond %{HTTP_REFERER} !^(http://(www\.)?puxa\.nl(/.*)?)?$ [NC]


If I rewrite it to:
Code: Select all
RewriteCond %{HTTP_REFERER} !^http://(www\.)?puxa\.nl(/.*)?$ [NC]


It works again. Any idea why?

Having
Code: Select all
^(pattern)?$

means that it will also match nothing (equivalent to ^$). In this case it allows for an empty Referer header because some browser's and firewalls do not send it and yu would be denying access by these users. The problem is that this also allows direct requests (you can type the URL into your browser).
richardk
 
Posts: 8800
Joined: Wed Dec 21, 2005 7:50 am

Postby Marcelo » Thu Oct 02, 2008 4:18 am

This check (that it is a .pdf file) is done in the RewriteRule
Code: Select all
^(.+)/([^/]+)\.pdf$


I wanted to check the file before starting the rewrite, but looking at it know, I longer see the need for this.

Having
Code: Select all
^(pattern)?$

means that it will also match nothing (equivalent to ^$). In this case it allows for an empty Referer header because some browser's and firewalls do not send it and yu would be denying access by these users. The problem is that this also allows direct requests (you can type the URL into your browser).


I had not thought about that. However, I want also to redirect the direct request without header to the .html file, so I need to add this rule too:

Code: Select all
Options +FollowSymLinks
RewriteBase /
RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} !(googlebot|slurp|twiceler|msnbot) [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?puxa\.nl(/.*)?$ [OR]
RewriteCond %{HTTP_REFERER} ^$ [NC]
RewriteRule  ^(.+)/([^/]+)\.pdf$ /$1/ [NC,L]


Thanks again Richard, I think I'm there. (Unless you spot something wrong in my last version :) )
Marcelo
 
Posts: 9
Joined: Tue Nov 06, 2007 6:03 am

Postby richardk » Thu Oct 02, 2008 7:22 am

I wanted to check the file before starting the rewrite, but looking at it know, I longer see the need for this.

The RewriteRule pattern is tested first. Ruleset Processing.

I had not thought about that. However, I want also to redirect the direct request without header to the .html file, so I need to add this rule too:

You do not need to add that. The first HTTP_REFERER line will match empty Referer headers.
richardk
 
Posts: 8800
Joined: Wed Dec 21, 2005 7:50 am

Postby Marcelo » Thu Oct 02, 2008 10:28 pm

Thanks for all the help! :) I learned a lot from this post.
Marcelo
 
Posts: 9
Joined: Tue Nov 06, 2007 6:03 am


Return to Content

Who is online

Users browsing this forum: Google [Bot] and 1 guest

cron