An introduction to blocking spambots and bad bots
Posted 2007-07-19 in Spam by Johann.
This post explains how to prevent robots from accessing your website.
Why certain robots should be excluded
Not all software on the net is used for activities beneficial to you. Bots are used to
- harvest email address (which are then sold and spammed),
- scrape websites (stealing content),
- scan for unlicensed content (MP3 files, movies),
- spam your blog.
Good bots
robots.txt
is a file that specifies how robots may interact with your website. robots.txt
is always placed at the root folder of a domain (https://johannburkard.de/robots.txt
). If a bot respects robots.txt
, there really is no need to take any of the measures described below.
Bots from the large search engines respect robots.txt
. Some bots used for research also do, such as IRL Bot.
To prevent a bot that respects robots.txt
from crawling a certain area of your website, add an entry to your robots.txt
file.
User-agent: <bot name> Disallow: /private/stuff
Don’t forget to validate your robots.txt
.
Bad bots
Bots that don’t respect robots.txt
are bad bots. There are different strategies how they can be blocked.
By user agent
In many cases, bad bots can be addressed by their user agent – a unique identification string sent together with requests. To block a bot based on it’s user agent, configure your web server accordingly (.htaccess
for Apache, lighttpd.conf
for lighttpd).
In this example, I configure my web server to block all requests made by user agents containing PleaseCrawl/1.0 (+http://www.pleasefeed.com; FreeBSD)
and Fetch API
:
$HTTP["useragent"] =~ "(…|PleaseCrawl|Fetch API|…)" { url.access-deny = ( "" ) }
Examples where this strategy can be effective:
- Known bad bots.
- HTTP programming libraries such as
lwp-trivial
orSnoopy
. - Spam software that contains small hints in the user agent string, such as
Bork-edition
.
By IP address
Companies that perform stealth crawling mostly operate from their own netblocks. Spammers or scrapers also might have their own IP addresses. You can block these addresses on your firewall or in your webserver configuration.
Example:
# Symantec $HTTP["remoteip"] == "65.88.178.0/24" { url.access-deny = ( "" ) }
Examples where this strategy can be effective:
- “Brand monitoring” and “rights management” companies such as Cyveillance, NameProtect, BayTSP.
- Spam-friendly hosters, such as Layered Technologies.
By behavior
Blocking bots by their behavior is probably the most effective but also the most complicated strategy. I admit I don’t know a software for behavior-based blocking. The Bad Behavior software only does this on a per-request basis.
The idea is that multiple requests from one IP address are analyzed. To be effective, this analysis must be performed in real-time or near-real-time.
Some factors that could be analyzed:
- Entry point into the site.
- Refering page. Accessing a page deep in a site without a referer is somewhat suspiciuos – some browsers can be configured to turn off the referer however.
- Which referer is passed on to scripts and images.
- Loading of images, CSS or scripts. Bots rarely request the same files real browsers do.
- Time intervals between requests. Loading multiple pages in very short (or very long) intervals is usually a sign of bots.
- Deviation from request intervals. Human visitors rarely request one page every 10 seconds.
- Incorrect URLs or invalid requests.
Subscribe
RSS 2.0, Atom or subscribe by Email.
Top Posts
- DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
- 6 fast jQuery Tips: More basic Snippets
- xslt.js version 3.2 released
- xslt.js version 3.0 released XML XSLT now with jQuery plugin
- Forum Scanners - prevent forum abuse
- Automate JavaScript compression with YUI Compressor and /packer/