Throw the bad guys off your web site

A few weeks ago, I started receiving automated emails from my web host (aplus.net) informing me that my site was using too many resources for the hosting package I’d bought. I had to upgrade to a bigger hosting package, or else…! I knew that I hadn’t changed anything substantial, and I wasn’t aware that my blog was getting more readers. Turns out it wasn’t, but somebody was stealing my bandwidth, effectively making be pay for their traffic: A few of my pictures had been embedded into some popular web sites, Myspace profiles, discussion boards, a third-party Facebook application, etc.

There are many articles and tutorials about how to prevent this type of hotlinking using a number of different strategies. Unfortunately some of the methods appear to be incompatible with each other (or at least require more structured .htaccess files than were specified in any of the articles describing the individual methods), so it took me a while to find a working, balanced solution.

But now that I’ve got a working configuration I thought I’d share it here. My solution consists of two separate .htaccess files:

  • A .htaccess file in the web root directory limits traffic by address (known blog SPAM’ers) and agent (known bad guy tools)
  • A .htaccess file in the /blog directory limits remote use (abuse) of my image files

In the excerpts below the notation “[…]” mean that I’ve cut out a bunch of similar lines for the purpose of brevity.

Refusing access to bad guys

The main directory .htaccess file denies access to known SPAMmers and known bad guy programs:

Options +FollowSymlinks
RewriteEngine On

<IfModule mod_access.c>
    <FilesMatch “.*”>
        Order Allow,Deny
        # Block by domain:
        Deny from .tfl.com.fj
        Deny from .ttnet.net.tr
        Deny from .vtr.net
        […]
        # Block by IP. Ranges can be denied by specifying a partial IP address, e.g. 200.83.4 = 200.83.4.*
        Deny from 122.252.226.40
        Deny from 140.128.20.205
        Deny from 200.83.4
        […]
        Allow from all
    </FilesMatch>
</IfModule>

<Limit GET PUT POST>
    Order Allow,Deny
    RewriteCond %{HTTP_USER_AGENT} ^WebMirror [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebReaper [NC,OR]
    […]
    RewriteCond %{HTTP_USER_AGENT} ^Zeus [NC]
    RewriteRule ^.* – [F,L]
</Limit>

The green text is some standard mumbo-jumbo that apparently has to be in the .htaccess file for the tricks to work. Following that, the red block of code denies all access to people or machines from specific locations on the Internet. These are domains or IPs from which SPAM bots have sent me a lot of unwanted traffic. I’m sure somebody somewhere has built a comprehensive blacklists of tremendous size, but I’ve just added a few addresses manually after analyzing server logs and Akismet reports and identifying some repeat offenders.

The blue block of code denies access to a number of programs that are known to harvest web content, harvest email addresses for later SPAM’ing or submitting blog comment SPAM (I forget where I found the list of HTTP_USER_AGENT codes I use). A request from any of these programs will get only an HTTP error 403: Forbidden.

Now for the hotlinking

The /blog directory .htaccess file handles hotlinking:

Options +FollowSymlinks
RewriteEngine on

RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^.*jankarlsbjerg\.com [NC]
RewriteCond %{HTTP_REFERER} !^.*google\. [NC]
RewriteCond %{HTTP_REFERER} !^.*yahoo\. [NC]
RewriteCond %{HTTP_REFERER} !^.*search\.live\. [NC]
RewriteCond %{HTTP_REFERER} !^.*ask\.com [NC]
RewriteCond %{HTTP_REFERER} !^.*planet\.northernvoice\.ca [NC]
RewriteCond %{HTTP_REFERER} !^.*urbanvancouver\.com [NC]
RewriteCond %{HTTP_REFERER} !^.*twitter\.com [NC]
RewriteCond %{HTTP_REFERER} !^.*search\?q=cache [NC]
RewriteCond %{HTTP_REFERER} !^.*feedburner.com/JanKarlsbjerg [NC]
RewriteCond %{HTTP_REFERER} !^.*bloglines\.com [NC]
RewriteRule \.(gif|jpg|jpeg|png|js|css)$ – [F]

The green text is the same mumbo-jumbo as above. The blue block of code allows sending of certain files only when the requesting page is hosted on my own domain (jankarlsbjerg.com) or on a small number of explicitly allowed sites (see the file extensions in the last line of code — if you host video files, you might want to extend the list of extensions). If the referring page is located on for example Myspace.com, the file will not be served; instead they will get the same HTTP error 403: Forbidden as above.

The external sites I have allowed to hotlink my picture files are

  • Some search engines
  • A couple of local blog aggregator sites that republish my RSS feed (this way I’m letting readers there see the full blog posts including pictures)
  • A couple of feed reader sites
  • Twitter.com (I might want to link to a picture file from my own Twitter stream)

I have also allowed access with an empty referrer page (see the first line of the blue code) which means that the files can be accessed by someone who types in the file’s URL or uses a bookmark pointing directly to the file.

A final tweak to the error reporting

Using the above tricks it still used to “cost” about 470 bytes of traffic from my server to send out the 403 Forbidden error. That’s because the server actually sends a nice little page of text describing the error. But that can be optimized that as well. I’ve added the line below to the .htaccess file in the main directory (in a line just after the green code block):

ErrorDocument 403 “Forbidden

Instead of a short page of text, a 403 Forbidden error now sends out only the text “Forbidden”. Depending on the version of the HTTP protocol used by the user’s browser (or other program), the 403 error now “costs” either 9 bytes (HTTP/1.0) or 21 bytes (HTTP/1.1) from my server.

By the way, there are no typos in the red code line above. “ErrorDocument” is one word, and there only has to be a quotation mark at the beginning of the error message (if you add one to the end of the message, that quotation mark will be sent to the recipient who will then see the error: Forbidden”

Done

As I mentioned in yesterday’s blog post, it took me a lot of time to work out this solution for my site. In this post I haven’t included anything about the syntax of the individual commands, etc. because there are plenty of articles available that describe .htaccess directives at that low level.

Instead I hope this little tutorial provides an overview of the issues and what can be done to combat server bandwidth abuse. Any comments you may have are very welcome!

The tricks described above threw the bad guys off my web site. Now I suggest you go forth and throw the bad guys off your web site too! 🙂

3 Comments to “Throw the bad guys off your web site”

  1. Candace 15 November 2007 at 12:20 #

    Sorry to hear about the knuckle wrap from your host but glad to see you turned it around and got rid of the bad guys. 🙂

    I’m curious what you’re using to display code in your blog. Is it a WordPress plugin or your own creation? Rob and I have been talking about a good way to show code with syntax highlighting.

    cheers,

    Candace

  2. Candace 15 November 2007 at 12:22 #

    hmmm what happened to the href? That was supposed to be Rob… my goodness…I’m talking to myself on someone else’s blog…

  3. Jan Karlsbjerg 15 November 2007 at 17:05 #

    Hi Candace, good to hear from you. 🙂

    I can’t reproduce the problem with the missing href. But I went in and edited your first comment so it links to Rob’s blog.

    In this post I just used blockquote to indent the code — and then font color to highlight the different sections. Very lazy and un-standard-compliance-geek of me.

    The result is not optimal because (with my style sheet) it puts the quoted text in italics. A code sample font should be monospaced (convention) and small (in order to avoid extra line breaks which if actually entered into an .htaccess file would certainly break the syntax. (But at least you can safely copy-paste the code lines).

    I’ve seen mentions of Windows Live Writer plugins for displaying code (with fancy syntax highlighting for a particular language), and I think I’ve also seen general WordPress plugins of that type. But I haven’t experimented with any of them.

    Btw, the dates for Northern Voice 2008 are out: February 22-23, 2008.


Leave a Reply

CommentLuv badge