I wrote an entry earlier on the HTTP referral check on dutchie.org and promised at the end of that article to write some more technical details. The referral links on Dutchie are dynamically generated based on the HTTP referrer setting passed by the browser. After a referral link has been detected, mangled and checked, we end up with a situation where either a link counter is updated or a new link with a title of ‘Automaticall Added’ will be created.
The title is not very descriptive obviously. On the Witches Three site where I first used this technique, I set it up so the link would be disabled at first and I’d go in periodically and edit the titles using manual inspection before enabling the links. For Dutchie I wanted to do this a bit differently, so I read up on the Perl HTML::Parser module. I still do most of my command line scripting in Perl because of Perl’s enormous amount of available libraries and it turns out that HTML::Parser may turn out to be a very valuable one to use. Currently I use it for two things:
- Scraping the title from the referring page
- Checking all the links on the referring page to see if they still link to dutchie.org
I’ll also give a brief usage documentation of another module I’ve found to be very useful, WWW::RobotRules.
The Linkcheck robot
The Dutchie linkcheck robot is in it’s essence a webcrawler. There are some politeness rules that webcrawlers should obey, even though nobody can force this. A website owner (or in many cases also community users) can specify which pages they would not mind to be crawled by a script and which pages they would mind finding crawled. They can do so by adding a simple file called ‘robots.txt’ to the root of their webserver.
The robots.txt allows website owners to specify:
User-Agent: [name of user agent] Allow: [URL] Disallow: [URL]
Using this simple format it’s possible to make all search engines ignore, for example, your login page:
User-Agent: * Disallow: /login.php
This is somewhat usefull I believe, as nobody really benefits from knowing of a login page in most situations. The Dutchie linkchecker identifies itself with the User-Agent ‘Dutchie/0.2′, so if you would want to block the Dutchie linkchecker from accessing your site completely you could write something like:
User-Agent: Dutchie/0.2 Disallow: /
While this would ofcourse make me very unhappy, it is a possibility! The check that the linkchecker does is done by the WWW::RobotRules Perl module:
1. use WWW::RobotRules;
2. my $rules = WWW::RobotRules->new('Dutchie/0.2');
3. if ($url =~ /(http:\/\/.*?\/).*/) {
4. $robotsurl = $1 . "robots.txt";
5. my $robots_txt = get $robotsurl;
6. $rules->parse($url, $robots_txt);
7. }
8. next if (!$rules->allowed($url));
It’s that simple! Obviously there’s some more code around this ($url is the URL that the linkchecker fetches from the database for example, while the ‘get’ function is part of the LWP::Simple library) but in it’s bare essence this is all that’s required. In line 2 we make a new WWW::RobotRules object, in line 3 we take the URL we want to check and strip the tail off of it (so for example http://dutchie.org/login.php becomes http://dutchie.org/) and add robots.txt to it to create the robots.txt URL http://dutchie.org/robots.txt. In line 5 we fetch this robots.txt and finally in line 8 we determine if we should proceed or not based on the rules parsing in line 6!
HTML::Parse
The HTML parser is quite a lot more difficult to use. It’s a library that aims to, and does so, do a LOT of things. The functionality is really quite impressive and while reading through the documentation I was wondering if it wouldn’t be easier to simply grep through the webpage
However, planning to use more functionality in the future, I decided to persist and kept on reading.
The HTML::Parser needs to be initialized first:
1. $contents = $res->content;
2. my $p = HTML::Parser->new(api_version => 3);
3. $p->report_tags('title','a');
4. $p->handler('start', \&tagstart_handler, 'tagname,attr');
5. $p->handler('text', \&text_handler, 'text');
6. $p->handler('end', \&tagend_handler, 'tagname');
7. $in_title = 0;
8. $dutchielink = 0;
9. $p->parse($contents);
The $contents variable has just been filled with a HTML page in line 1. First I initialize a parser object in line 2 and after that I tell the parser to report only on the tags ‘title’ and ‘a’. It’s also possible to add this to the parser constructor. The report_tags function basically tells the object not to do anything for any other tags it finds in the HTML. This would imply that it does do something for ‘title’ and ‘a’. Well that’s not quite true yet. I first need to tell the object what to do when it sees a start (or end, or text) of one of the reported-on tags. I do this in lines 4, 5 and 6. So line 4 for example tells the object to:
- Call the function ‘tagstart_handler’ with attributes ‘tagname’ and ‘attr’ as arguments.
When we take a look at the tagstart_handler function:
1. sub tagstart_handler {
2. ($tagname,$attr) = @_;
3. if ($tagname eq "a") {
4. $href = $attr->{href};
5. if ($href =~ /http:\/\/dutchie.org*/ ||
6. $href =~ /http:\/\/www.dutchie.org*/) {
7. $dutchielink = 1;
8. }
9. } elsif ($tagname =~ /title/) {
10. $in_title = 1;
11. }
12. }
We can see that we only check the tagname argument to be either ‘a’ or ‘title’ within this function, after all those are the only tags we report on when we have a start tag. When we encounter an ‘a’ start tag, we check if there’s an ‘href’ attribute to it, check whether the href argument starts with either http://dutchie.org or http://www.dutchie.org and if this is the case we toggle the $dutchielink variable (which had been set to 0 previous to the parsing of the page). The $dutchielink variable never gets reset during the parsing so basically we could stop parsing when we see the first valid link, after all one link on the referral page is enough. That will probably be in version 0.3 though.
The title parsing is a bit more tricky. This is where we need the ‘text’ and the ‘end’ handlers for. A title typically looks like <title>Dutchie family site</title> and we want to scrape (in this case) the ‘Dutchie family site’ part. This is not an attribute to the <title> tag, it’s in fact something that’s between the start and the end handler for the ‘title’ tag. So the only thing we do when we see a ‘title’ start is to remember that we’ve seen it by setting the $in_title variable to 1. You can probably guess that the ‘end’ handler for the title tag (well, any reported tag factually) will reset the $in_title variable? Right…
So this is where the ‘text’ handler comes in. This handler gets called on the HTML between any start and end tag so it will also be called on big divs, tables etc. (this is something to keep in mind, with LARGE chunks of HTML the HTML parser needs to do a LOT of work). Since the text handler gets called on any and all chunks of text between tags, it will also be called between the <title> and the </title> tags. With the text handler looking like:
sub text_handler {
($text) = @_;
if ($in_title) {
$title = $text;
}
}
it now becomes easy to see just why I’ve set the $in_title flag in the start handler for the ‘title’ tag; Only when this flag is set will the script put the $text variable passed by the parser into the $title global.
So now the rest of the linkchecker becomes rather trivial:
- If a referral link is found (check whether $dutchielink is set), update the database with the $title