Fun with filters
Posted: November 6th, 2002 | 5 Comments »So there I was, desperately trying to get some work finished before going out, when Matt IMed me out of the blue and derailed me in the surest way possible: namely, by asking me to do a quick fun bit of Perl hacking for him. (Click here to see why)
What he wanted (and was sure existed out there somewhere but couldn’t find it) was a running CGI script (sorry, RESTian Web Service) that takes a URL to a page and replaces all occurences of a given string with another string before spitting the page out. Since I had the code on hand already (in my form linker service) it only took a couple of minutes. You can try it here:
As you may see if you run the default example above, it’s good but it’s not
perfect. This is a shame because, apart from the base URI problem, writing these kinds of filters is an utter doddle. (That’s British for “really easy”) You too can write your own Pornolizer or ValleyURL! See below for a brief bit of exploration and advice, as well as a plea for help to anyone who can help fix the bug.
Let’s get the basics out of the way first: even a novice server-side coder can write this stuff in a few lines. All your script is going to do is:
- Read in some form variables to use
- Go fetch a page from a given URL
- Alter the page in the desired fashion
- Change the base URI of the page so as to keep relative links working
- Spit it out
If you can’t do point 1, you need to go off and learn about CGI programming.
Point 2 is usually achievable with one line of code if you have a decent URI/web toolkit. I use the fantastic LWP::Simple for Perl, like so:
my $html = get($url);
And that’s all it takes to grab a page from the web and stick it in a string.
It is also all it takes to expose a freaking huge security hole.
Most generic URI-fetching libraries will fetch many different types of URIs, not just the ones that start with http://. And this URI that you’re fetching was given to you from an untrusted source. So if some joker comes along and types in:
file:///etc/passwd
… I hope you see the problem. Fortunately, all you need to do is check that the URI starts with http:// and return an error if it doesn’t, and you’re sorted. (You may want to allow https:// URIs too)
Next, altering the content. Hey, this kind of text munging is what Perl is for. Note, however, that passing data straight from CGI input into a regular expression can expose more security problems, so clean those variables up first with the quotemeta() function. (This exists for PHP too)
my $target = quotemeta($cgi->param("t"));
Point 4 is where it gets a bit hairy. The base URI is the web server folder that the browser will look in for relatively-addressed files referenced by the page. As an example, suppose the page we’re dealing with lives at http://www.domain.com/dir/page.html. The base URI of this page is http://www.domain.com/dir/. The page includes an image specified like so: <img src="frog.png"> . To fetch this image, the browser will append the filename to the base URI to get the complete URI.
The trouble is that the browser works out the base URI from the URI it used to fetch the page, which in this case isn’t going to work because the URI points to our filter script and not the original page. So we need to manually force the browser to use a different base URI.
First, we need to derive the base URI from the URI we used to fetch the page:
if ($url =~ /\/[^\/]*$/) # match everything after the last slash
{
$base = $` . "/"; # now grab everything before
}
There are two ways to force a base URI change, and I use both of them. The first is to change the HTTP header you output to the browser and add a Content-Location: attribute which specifies the new base. The second is to add a BASE element to the document’s HEAD. There are clean and proper ways of doing this, and I’m going to ignore them and just do a dirty regexp:
$html =~ s/(<HEAD([^>"']*|'[^']*'|"[^"]*")*>)/$1\n<BASE HREF="$base">\n/si
unless $html =~ /<BASE HREF=/;
What that bizarre mess does is look for the document’s HEAD tag and stick a BASE tag immediately after it – but only if the page doesn’t have a BASE tag already. (Let’s hope it’s not in a comment.)
And this is where my bug comes in. Modifying the base seems to work fine for all relatively-addressed items apart from stylesheets, both in MSIE and Mozilla. I don’t know why, and it’s rather irritating. I decided to have a look at some of the better known filters on the web, and found that Pornolize does a bizarre trick that achieves partial (but not entire) success: they modify any LINK and META tags in the page like so:
From:
<link href="http://cheerleader.yoz.com/styles-site.css" type="text/css" rel="stylesheet">
To:
<link /="/" href="http://cheerleader.yoz.com/styles-site.css" type="text/css" rel="stylesheet">
Now, what the hell does /="/" do?
Am confused. The deep practical interaction between HTML and HTTP is bizarre and impenetrable, and I am tired, so this entry ends here with me shaking my head in despair. Do let me know if you can ease my plight.
Nice! But as you’ve pointed out, not without some issues. (Try, for example, replacing all instances of “the” on, say, http://downlode.org/blog.pl with “these”.) Still, good one.
You also get problems if html tags are fiddled with (eg replace ‘<’ with ‘ ‘, or ‘div’ with ‘img’ on any page).
In other news, the HTML::Munger module does a lot of the hard work for you. It avoids your problem by doing it the hard way, by rewriting all relevant urls in the page rather than using a base href.
Earle: Cheers for the bug report; I’ve fixed that.
Paul: HTML::Munger’s nice but doesn’t actually do that much. Plus, I think that ignoring the fact that HTML and HTTP already have a (mostly) good method of changing the base URI is a bit silly. I may use it for a future filter rewrite.
In other news, that example bug I talk about for much of this entry seems to be gone; the filtered page now calls the correct stylesheet, without my having changed anything. Odd.
Unfortunately it also finds and replaces in links, image names etc. So if you fetch http://www.oracle.com and replace “ora” with “xxx” you end up with links to xxxcle.com and all the pictures disappear, because it is looking at xxxcle.com for them!
olá otários de merda.