The top 10 spam bot user agents you MUST block. NOW.
Posted 2007-06-25 in Spam by Johann.
Spambots and badly behaving bots seem to be all the rage this year. While it will be very hard to block all of them, you can do a lot to keep most of the comment spammers away from your blog and scrapers from harvesting your site.
In this entry, I assume you know how to read regular expressions. Note that I randomly mix comment spambots, scrapers and email harvesters.
User agent strings to block
""
. That’s right. An empty user agent. If someone can’t be arsed to set a user-agent, why should you serve him anything?^Java
. Not necessarily anything containingJava
but user agents starting withJava
.^Jakarta
. Don’t ask. Just block.User-Agent
. User agents containingUser-Agent
are most likely spambots operating from Layered Technologies’s network – which you should block as well.compatible ;
. Note the extra space. Email harvester."Mozilla"
. Only the stringMozilla
of course.libwww
,lwp-trivial
,curl
,PHP/
,urllib
,GT::WWW
,Snoopy
,MFC_Tear_Sample
,HTTP::Lite
,PHPCrawl
,URI::Fetch
,Zend_Http_Client
,http client
,PECL::HTTP
. These are all HTTP libraries I DO NOT WANT.panscient.com
. Who would forget panscient.com in a list of bad bots?IBM EVV
,Bork-edition
,Fetch API Request
.[A-Z][a-z]{3,} [a-z]{4,} [a-z]{4,}
. This matches nonsense user-agents such asZobv zkjgws pzjngq
. Most of these originate from layeredtech.com. Did I mention you should block Layered Technologies?WEP Search
,Wells Search II
,Missigua Locator
,ISC Systems iRc Search 2.1
,Microsoft URL Control
,Indy Library
. Oldies but goldies. Sort of.
More stuff to potentially block
You might also block some or all of the following user agents.
Nutch
larbin
heritrix
ia_archiver
These are coming from open source search engines or crawlers. Or the Internet Archive. I’m not seeing any benefit to my site so I block them as well.
Links
The Project Honeypot statistics are usually a good place to keep an eye on.
- Harvester User Agents | Project Honey Pot
- Top Web Robots | Comment Spammer Agents | Project Honey Pot
- Behind the Scenes with Apache's .htaccess – good resource for the Apache users (I know they exist).
6 comments
#1 2008-09-25 by F. Andy Seidl
As a webmaster, you definitely *should* use user-agent headers to manager server traffic. But understand that this is purely a pragmatic tactic and not a serious security measure.
I've written more about this here:
Webmaster Tips: Blocking Selected User-Agents
http://faseidl.com/public/item/213126
#2 2008-10-08 by Jack
Just a thought, but is there an online service to identify if a particular IP address comes from a bot ?
e.g. something like:
http://madeupname.findbot.net/api/isbot?ip=123.45.67.89
that could return something like Y or N ?
Jack,
not that I would know. There is Botslist and several scripts for the IP delivery/cloaking guys. Also, you might want to check out Project Honey Pot.
Hi Johann,
Just want to mention that botslist.com now supports the service that Jack requested.
The url http://www.botslist.com/whois?ip= 123.45.67.89 will return a 404-Not Found if the ip is not active in the botslist database and 200-Ok otherwise.
In the second case, detailed information from the database record for the ip will also be returned with the 200-Ok response.
Cheers
Mike
Subscribe
RSS 2.0, Atom or subscribe by Email.
Top Posts
- DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
- 6 fast jQuery Tips: More basic Snippets
- xslt.js version 3.2 released
- xslt.js version 3.0 released XML XSLT now with jQuery plugin
- Forum Scanners - prevent forum abuse
- Automate JavaScript compression with YUI Compressor and /packer/