The top 10 spam bot user agents you MUST block. NOW.

Posted 2007-06-25 in Spam by Johann.

Spambots and badly behaving bots seem to be all the rage this year. While it will be very hard to block all of them, you can do a lot to keep most of the comment spammers away from your blog and scrapers from harvesting your site.

In this entry, I assume you know how to read regular expressions. Note that I randomly mix comment spambots, scrapers and email harvesters.

User agent strings to block

  • "". That’s right. An empty user agent. If someone can’t be arsed to set a user-agent, why should you serve him anything?
  • ^Java. Not necessarily anything containing Java but user agents starting with Java.
  • ^Jakarta. Don’t ask. Just block.
  • User-Agent. User agents containing User-Agent are most likely spambots operating from Layered Technologies’s network – which you should block as well.
  • compatible ;. Note the extra space. Email harvester.
  • "Mozilla". Only the string Mozilla of course.
  • libwww, lwp-trivial, curl, PHP/, urllib, GT::WWW, Snoopy, MFC_Tear_Sample, HTTP::Lite, PHPCrawl, URI::Fetch, Zend_Http_Client, http client, PECL::HTTP. These are all HTTP libraries I DO NOT WANT.
  • panscient.com. Who would forget panscient.com in a list of bad bots?
  • IBM EVV, Bork-edition, Fetch API Request.
  • [A-Z][a-z]{3,} [a-z]{4,} [a-z]{4,}. This matches nonsense user-agents such as Zobv zkjgws pzjngq. Most of these originate from layeredtech.com. Did I mention you should block Layered Technologies?
  • WEP Search, Wells Search II, Missigua Locator, ISC Systems iRc Search 2.1, Microsoft URL Control, Indy Library. Oldies but goldies. Sort of.

More stuff to potentially block

You might also block some or all of the following user agents.

  • Nutch
  • larbin
  • heritrix
  • ia_archiver

These are coming from open source search engines or crawlers. Or the Internet Archive. I’m not seeing any benefit to my site so I block them as well.

Links

The Project Honeypot statistics are usually a good place to keep an eye on.

6 comments

#1 2008-09-25 by F. Andy Seidl

As a webmaster, you definitely *should* use user-agent headers to manager server traffic. But understand that this is purely a pragmatic tactic and not a serious security measure.

I've written more about this here:

Webmaster Tips: Blocking Selected User-Agents
http://faseidl.com/public/item/213126

#2 2008-10-08 by Jack

Just a thought, but is there an online service to identify if a particular IP address comes from a bot ?

e.g. something like:

http://madeupname.findbot.net/api/isbot?ip=123.45.67.89
that could return something like Y or N ?

#3 2008-10-19 by Johann

Jack,

not that I would know. There is Botslist and several scripts for the IP delivery/cloaking guys. Also, you might want to check out Project Honey Pot.

#4 2008-12-14 by Mike

Hi Johann,

Just want to mention that botslist.com now supports the service that Jack requested.

The url http://www.botslist.com/whois?ip= 123.45.67.89 will return a 404-Not Found if the ip is not active in the botslist database and 200-Ok otherwise.

In the second case, detailed information from the database record for the ip will also be returned with the 200-Ok response.

Cheers
Mike

#5 2009-10-24 by Lassar

I see you have Mozilla listed

Why would one want to block FireFox ?

#6 2009-10-25 by Johann

Lassar,

the regular expression should be /^Mozilla$/ which blocks only the string "Mozilla." You can also block /^Mozilla\/5\.0$/, a user agent currently very popular with vulnerability scanners.

Subscribe

RSS 2.0, Atom or subscribe by Email.

Top Posts

  1. DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
  2. 6 fast jQuery Tips: More basic Snippets
  3. xslt.js version 3.2 released
  4. xslt.js version 3.0 released XML XSLT now with jQuery plugin
  5. Forum Scanners - prevent forum abuse
  6. Automate JavaScript compression with YUI Compressor and /packer/

Navigation