An introduction to blocking spambots and bad bots

Posted 2007-07-19 in Spam by Johann.

This post explains how to prevent robots from accessing your website.

Why certain robots should be excluded

Not all software on the net is used for activities beneficial to you. Bots are used to

harvest email address (which are then sold and spammed),
scrape websites (stealing content),
scan for unlicensed content (MP3 files, movies),
spam your blog.

Good bots

robots.txt is a file that specifies how robots may interact with your website. robots.txt is always placed at the root folder of a domain (https://johannburkard.de/robots.txt). If a bot respects robots.txt, there really is no need to take any of the measures described below.

Bots from the large search engines respect robots.txt. Some bots used for research also do, such as IRL Bot.

To prevent a bot that respects robots.txt from crawling a certain area of your website, add an entry to your robots.txt file.

User-agent: <bot name>
Disallow: /private/stuff

Don’t forget to validate your robots.txt.

Bad bots

Bots that don’t respect robots.txt are bad bots. There are different strategies how they can be blocked.

By user agent

In many cases, bad bots can be addressed by their user agent – a unique identification string sent together with requests. To block a bot based on it’s user agent, configure your web server accordingly (.htaccess for Apache, lighttpd.conf for lighttpd).

In this example, I configure my web server to block all requests made by user agents containing PleaseCrawl/1.0 (+http://www.pleasefeed.com; FreeBSD) and Fetch API:

$HTTP["useragent"] =~ "(…|PleaseCrawl|Fetch API|…)" {
 url.access-deny = ( "" )
}

Examples where this strategy can be effective:

Known bad bots.
HTTP programming libraries such as lwp-trivial or Snoopy.
Spam software that contains small hints in the user agent string, such as Bork-edition.

By IP address

Companies that perform stealth crawling mostly operate from their own netblocks. Spammers or scrapers also might have their own IP addresses. You can block these addresses on your firewall or in your webserver configuration.

Example:

# Symantec

$HTTP["remoteip"] == "65.88.178.0/24" {
 url.access-deny = ( "" )
}

Examples where this strategy can be effective:

“Brand monitoring” and “rights management” companies such as Cyveillance, NameProtect, BayTSP.
Spam-friendly hosters, such as Layered Technologies.

By behavior

Blocking bots by their behavior is probably the most effective but also the most complicated strategy. I admit I don’t know a software for behavior-based blocking. The Bad Behavior software only does this on a per-request basis.

The idea is that multiple requests from one IP address are analyzed. To be effective, this analysis must be performed in real-time or near-real-time.

Some factors that could be analyzed:

Entry point into the site.
Refering page. Accessing a page deep in a site without a referer is somewhat suspiciuos – some browsers can be configured to turn off the referer however.
Which referer is passed on to scripts and images.
Loading of images, CSS or scripts. Bots rarely request the same files real browsers do.
Time intervals between requests. Loading multiple pages in very short (or very long) intervals is usually a sign of bots.
Deviation from request intervals. Human visitors rarely request one page every 10 seconds.
Incorrect URLs or invalid requests.

RSS 2.0, Atom or subscribe by Email.

An introduction to blocking spambots and bad bots

Why certain robots should be excluded

Good bots

Bad bots

By user agent

By IP address

By behavior

Subscribe

Top Posts

Categories

Navigation