Apache's RewriteEngine

While developping Dutchie I try to keep SEO in mind. Now that I’m done with the raw work on defining most of the objects used (eg. Video’s, Photo’s, Discussions etc.) and allowing people to edit/delete those objects, it’s about time that I start working on the sharing part and that’s where the SEO comes in.

It’s hard to tell what search engines like but a few things I keep reading are:

  • HTML files get indexed better than PHP files with tons of arguments in the URL

While this makes some sort of sense, after all a URL like http://dutchie.org/artikelen/show_my-apache-rewrite-engine-article.html says more about the contents than http://dutchie.org/showartikel.php?id=5439, I’m curious to see if this will indeed be the case. In previous sites I’ve used pages with lots of links in a listing and I’ve found that most major search engines that come crawling by will actually visit all the links as well.

So I’m going to try and make the URL’s more SEO friendly like a lot of articles suggest, but how to go about it? I obviously don’t want to make an HTML page for every article somebody submits and I would like to have a URL that lists all the objects in a certain category as well.

Enter Apache’s RewriteEngine. When you Google for rewriteengine you’ll find lots of articles about it telling you it’s some sort of dark voodoo that should only be touched when everything else fails.

That’s not really so, I’ve found it to be rather flexible so far!

The first thing we need to do is to turn on the RewriteEngine. This is done simply by a line in the website’s config that looks like:

  • RewriteEngine On.

Now, to make a listing page that shows all the objects in a certain category, let’s take the article listing as an example again, I’d like the URL http://dutchie.org/artikelen/ to map to a script showarticles.php. This is only slightly harder than turning on the rewrite engine:

  • RewriteRule /artikelen/ /showarticles.php

That’s really all there is to it! This simple rule makes sure that every /artikelen/ in the URL gets rewritten to /showarticles.php

Somewhat more tricky is the following rule:

  • RewriteRule /artikelen/show_([\w]+)\.html /showobject.php?subject=$1

Here we see a regular expression in action; In regular expressions, \w is any word character, [...] is used to define a set of characters (so in this case just \w) and + is used to say ‘one or more’. So [\w]+ simply means ‘a word’.

The (…) is used to turn the result of the expression into a variable called $1, which we use again in the second part of the RewriteRule.

If Apache sees a URL http://dutchie.org/artikelen/show_my-apache-rewrite-engine-article.html it will get a regular expression match on ‘my-apache-rewrite-engine-article’ and put that in the variable $1. This makes the second part of the RewriteRule expand into /showarticle.php?showobject.php?subject=my-apache-rewrite-engine-article.html

showobject.php can simply use the subject argument to pull the object with the same title from the database, output it to the webserver, and a new ‘static-but-oh-so-dynamic’ HTML page is born spontaneously!

Order Matters

The RewriteEngine gets even a bit more complex when we realize that:

  • One RewriteRule can mask another.

Lastly, real people tend to not just put word characters in the titles of their articles. Typically there will also be whitespace (\s in regular expression speak) and perhaps even some other characters like ‘. For now I’ve expanded the second rewriterule to ([\w\s']+) but I doubt whether this will be sufficient.

Generalisation

Since I have a fair amount of object types I want to match to a show_*.php file, I was wondering if I could not do this in a more generalized way. The first thing we see in an example URL /families/ -> /show_families.php… is that the keyword ‘families’ gets put in the PHP filename. This is the case with all the object listing pages. Furthermore I’ve also found out that when search engines see a link like /families/, they will actually try to access /families instead. If we make our generalized object listing RewriteRule look like:

  • RewriteRule ^/([a-z]+)[/]*$ /toon$1.php

we would make /families/ rewritten to /toonfamilies.php, /fotos/ to /toonfotos.php etc.

Excessive Matching

It is very interesting to turn on some logging for the RewriteEngine by using:

  • RewriteLogLevel 9
  • RewriteLog /tmp/rewritelog

When we go to /families/, using the above RewriteRule would take us to /toonfamilies.php. This script in turn would generate a header which might include some Javascript, some CSS and a body that might include all kinds of other stuff. All of these things need to be retrieved by the browser to render the whole page. Everything that needs to be retrieved from the server will be inspected by the RewriteEngine. If we have a lot of RewriteRule’s, this could result in quite some action inside Apache! So the first thing to reduce this load is to add a [L] to our above RewriteRule:

  • RewriteRule ^/([a-z]+)[/]*$ /toon$1.php [L]

This will cause this RewriteRule to be the last processed RewriteRule if it matches. So when we go to /families/, the RewriteEngine rewrites it to /toonfamilies.php and then stops doing further matching.

Another trick to use is the NOT character ‘!’. Dutchie also hosts The Witches Three, a site that also uses a lot of RewriteEngine rules. All of these rules start with /Tracy/. If we preceed these rules with:

  • RewriteRule !^/Tracy/ – [L]

We’ll make sure that all the TW3 Rewrites will not be attempted if the URL doesn’t start with /Tracy/ to begin with!

Security

What we’re basically doing with the second rewrite rule is to map any URL that has ([\w\s']+) to an HTTP argument to a PHP script. The argument to the script is an article title that will be looked up in a database like:

  • SELECT * FROM articles WHERE subject=’argument’

So since the URL can be anything, it could also include stuff like ‘; SELECT * FROM USERS’ (SQL injection). We’re not quite there since a ‘;’ would not be matched by the regular expression but we need to stay cautious of filtering the argument that gets passed by Apache!

email

About Fred Leeflang

Hoi! Ik ben de website beheerder van de Forza website.