The top 10 spam bot user agents you MUST block. NOW.
Posted 2007-06-25 in Spam by Johann.
Spambots and badly behaving bots seem to be all the rage this year. While it will be very hard to block all of them, you can do a lot to keep most of the comment spammers away from your blog and scrapers from harvesting your site.
In this entry, I assume you know how to read regular expressions. Note that I randomly mix comment spambots, scrapers and email harvesters.
User agent strings to block
"". That’s right. An empty user agent. If someone can’t be arsed to set a user-agent, why should you serve him anything?^Java. Not necessarily anything containingJavabut user agents starting withJava.^Jakarta. Don’t ask. Just block.User-Agent. User agents containingUser-Agentare most likely spambots operating from Layered Technologies’s network – which you should block as well.compatible ;. Note the extra space. Email harvester."Mozilla". Only the stringMozillaof course.libwww,lwp-trivial,curl,PHP/,urllib,GT::WWW,Snoopy,MFC_Tear_Sample,HTTP::Lite,PHPCrawl,URI::Fetch,Zend_Http_Client,http client,PECL::HTTP. These are all HTTP libraries I DO NOT WANT.panscient.com. Who would forget panscient.com in a list of bad bots?IBM EVV,Bork-edition,Fetch API Request.[A-Z][a-z]{3,} [a-z]{4,} [a-z]{4,}. This matches nonsense user-agents such asZobv zkjgws pzjngq. Most of these originate from layeredtech.com. Did I mention you should block Layered Technologies?WEP Search,Wells Search II,Missigua Locator,ISC Systems iRc Search 2.1,Microsoft URL Control,Indy Library. Oldies but goldies. Sort of.
More stuff to potentially block
You might also block some or all of the following user agents.
Nutchlarbinheritrixia_archiver
These are coming from open source search engines or crawlers. Or the Internet Archive. I’m not seeing any benefit to my site so I block them as well.
Links
The Project Honeypot statistics are usually a good place to keep an eye on.
- Harvester User Agents | Project Honey Pot
- Top Web Robots | Comment Spammer Agents | Project Honey Pot
- Behind the Scenes with Apache's .htaccess – good resource for the Apache users (I know they exist).
6 comments
#1 2008-09-25 by F. Andy Seidl
As a webmaster, you definitely *should* use user-agent headers to manager server traffic. But understand that this is purely a pragmatic tactic and not a serious security measure.
I've written more about this here:
Webmaster Tips: Blocking Selected User-Agents
http://faseidl.com/public/item/213126
#2 2008-10-08 by Jack
Just a thought, but is there an online service to identify if a particular IP address comes from a bot ?
e.g. something like:
http://madeupname.findbot.net/api/isbot?ip=123.45.67.89
that could return something like Y or N ?
Jack,
not that I would know. There is Botslist and several scripts for the IP delivery/cloaking guys. Also, you might want to check out Project Honey Pot.
Hi Johann,
Just want to mention that botslist.com now supports the service that Jack requested.
The url http://www.botslist.com/whois?ip= 123.45.67.89 will return a 404-Not Found if the ip is not active in the botslist database and 200-Ok otherwise.
In the second case, detailed information from the database record for the ip will also be returned with the 200-Ok response.
Cheers
Mike
Lassar,
the regular expression should be /^Mozilla$/ which blocks only the string "Mozilla." You can also block /^Mozilla\/5\.0$/, a user agent currently very popular with vulnerability scanners.
Subscribe
RSS 2.0, Atom or subscribe by Email.
Top Posts
- DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
- 6 fast jQuery Tips: More basic Snippets
- xslt.js version 3.2 released
- xslt.js version 3.0 released XML XSLT now with jQuery plugin
- Forum Scanners - prevent forum abuse
- Automate JavaScript compression with YUI Compressor and /packer/