White listing User Agents to combat Spam Bots and Scrapers

Posted 2008-03-21 in Spam by Johann.

IncrediBill mentioned white listing user agents to block spam bots and other types of abusers. I then tried out white listing as opposed to black listing as I did before.

Here is a short explanation of the difference between black listing and white listing.

Black listing

Black listing means rejecting all requests that fit into a certain pattern.

Black listing looks like this in my web server configuration:

$HTTP["useragent"] =~ "(bad_bot|comment spammer|spambot 1)" {
 url.access-deny = ( "" )
}

This means that all user agent strings containing bad_bot, comment spammer or spambot 1 are served a 403 Forbidden error message.

White listing

White listing means rejecting all request that do not fit into a certain pattern.

White listing looks as follows in my configuration:

$HTTP["useragent"] !~ "^(Mozilla|Opera)" {
 url.access-deny ( "" )
}

This means that only user agent strings starting with Mozilla or Opera are allowed, everything else is served a 403 Forbidden error message.

The downside of white listing is that the number of allowed user agents can be very large. As an example, user agents of Motorola cell phones start with Motorola-, but some also start with MOT-, MOTOROKR or even motorazr.

Right now, I have more than 100 rules in my white list regular expression.

Which one is right for me?

I recommend using black lists if you cannot spend much time reading log files and changing your web server’s configuration. Black listing, combined with IP blocking of known abusers, can still be effective in limiting bandwidth theft.

Black listing, however, will not prepare you against future bad bots and reincarnations of Russian email harvester outfits. This is where white listing is better.

4 comments

#1 2008-03-27 by Johann

Bill,

I know the UA_PROF stuff. Don't want to make my config even more complicated than it already is.

I didn't claim my white list is bigger than my black list was, I remember having four to five rows in my black list and my white list is about the same size.

#2 2008-03-27 by IncrediBILL

You don't need to whitelist each phone name as there is a piece of data in the http header that identifies a mobile device.

I'd tell you what it was but wouldn't that spoil all the fun of researching it yourself?

I'm actually surprised you claim your white list was bigger than your black list because I used to have almost 1K items in my black list and now I have less than 30 in my white list, I'll never revert back to the old method.

#3 2009-11-16 by Krayon

A quick search on SourceForge, there's apparently 1510 browsers available on it, not to mention one's that DON'T use SourceForge.  Whitelisting is not feasible (I had to change my UA about 5 times before I could get one that worked for this site).

#4 2009-12-02 by Try both....

Blacklist known baddies
Whitelist known goodies
Everyone else - return a not-allowed-comment page and include them on a to-be-looked-at list

Subscribe

RSS 2.0, Atom or subscribe by Email.

Top Posts

  1. DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
  2. 6 fast jQuery Tips: More basic Snippets
  3. xslt.js version 3.2 released
  4. xslt.js version 3.0 released XML XSLT now with jQuery plugin
  5. Forum Scanners - prevent forum abuse
  6. Automate JavaScript compression with YUI Compressor and /packer/

Navigation