Attributor: Abuse other people’s resources with confidence
Posted 2008-03-12 in Spam by Johann.
Attributor constantly scans billions of web pages to find copies of your content across the Internet
…in stealth mode, of course.
$ egrep '^64.41.145.' <logfile> 64.41.145.193 … "GET /blog/?flavor=rss2 HTTP/1.1" 301 0 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" "-" 64.41.145.193 … "GET /blog/ HTTP/1.1" 200 18934 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" "-"
Their other crawler is slightly more “open”:
64.41.145.167 … "GET /software/stringsearch/ HTTP/1.0" 403 4252 "…" "Mozilla/5.0 (compatible; attributor/1.13.2 +http://www.attributor.com)" "-"
Block Attributor
Let’s perform the standard RDNS trick.
(Asked whois.arin.net:43 about +64.41.145.193) … OrgName: Attributor Corporation OrgID: ATTRI Address: 1779 Woodside Road Suite 200 Address: ATTN Adrian McDermott City: REDWOOD CITY StateProv: CA PostalCode: 94061 Country: US NetRange: 64.41.145.0 - 64.41.145.255 CIDR: 64.41.145.0/24
Attributor’s netblock: 64.41.145.0/24.
The top 10 spam bot user agents you MUST block. NOW.
Posted 2007-06-25 in Spam by Johann.
Spambots and badly behaving bots seem to be all the rage this year. While it will be very hard to block all of them, you can do a lot to keep most of the comment spammers away from your blog and scrapers from harvesting your site.
In this entry, I assume you know how to read regular expressions. Note that I randomly mix comment spambots, scrapers and email harvesters.
User agent strings to block
"". That’s right. An empty user agent. If someone can’t be arsed to set a user-agent, why should you serve him anything?^Java. Not necessarily anything containingJavabut user agents starting withJava.^Jakarta. Don’t ask. Just block.User-Agent. User agents containingUser-Agentare most likely spambots operating from Layered Technologies’s network – which you should block as well.compatible ;. Note the extra space. Email harvester."Mozilla". Only the stringMozillaof course.libwww,lwp-trivial,curl,PHP/,urllib,GT::WWW,Snoopy,MFC_Tear_Sample,HTTP::Lite,PHPCrawl,URI::Fetch,Zend_Http_Client,http client,PECL::HTTP. These are all HTTP libraries I DO NOT WANT.panscient.com. Who would forget panscient.com in a list of bad bots?IBM EVV,Bork-edition,Fetch API Request.[A-Z][a-z]{3,} [a-z]{4,} [a-z]{4,}. This matches nonsense user-agents such asZobv zkjgws pzjngq. Most of these originate from layeredtech.com. Did I mention you should block Layered Technologies?WEP Search,Wells Search II,Missigua Locator,ISC Systems iRc Search 2.1,Microsoft URL Control,Indy Library. Oldies but goldies. Sort of.
More stuff to potentially block
You might also block some or all of the following user agents.
Nutchlarbinheritrixia_archiver
These are coming from open source search engines or crawlers. Or the Internet Archive. I’m not seeing any benefit to my site so I block them as well.
Links
The Project Honeypot statistics are usually a good place to keep an eye on.
- Harvester User Agents | Project Honey Pot
- Top Web Robots | Comment Spammer Agents | Project Honey Pot
- Behind the Scenes with Apache's .htaccess – good resource for the Apache users (I know they exist).
6 comments
Block empty User Agent headers
Posted 2008-04-19 in Spam by Johann.
Blocking requests without a user agent header is a simple step to reduce web abuse. I’ve shown before that this can be a significant number.
On this server, 985 requests without user agent were made in the last four weeks which constitutes 6 % of the 14388 blocked requests. 6 % might not sound much but once I started white listing user agents, the percentage of blocked requests went up from less than 1 % to over 4 %. Unless you are also white listing user agents and block entire netblocks as aggressively as I do, your number can be higher.
lighttpd
Blocking empty user agents is simple in lighttpd. Edit your lighttpd.conf as follows:
Ensure that mod_access is loaded:
server.modules = (
"mod_accesslog",
"mod_access",
… other modules …
)
Add the following line:
$HTTP["useragent"] == "" {
url.access-deny = ( "" )
}
Reload the lighttpd configuration and you’re done.
Apache
Enable mod_rewrite and add the following to your configuration:
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteRule ^.* - [F]
Contributed by Andrew.
If you use a web server other than lighttpd or Apache, please add the configuration to this entry. Thanks.
7 comments
Mozilla/4.0 (compatible;)
Posted 2007-05-09 in Spam by Johann.
When I first published this entry in May 2007, I thought this was just another web scraper.
… "GET / HTTP/1.1" 200 7518 "-" "Mozilla/4.0 (compatible;)" "-" … "GET /help/copyright.html HTTP/1.1" 200 4127 "-" "Mozilla/4.0 (compatible;)" "-" … "GET /help/sitemap.html HTTP/1.1" 200 4902 "-" "Mozilla/4.0 (compatible;)" "-" … "GET /favicon.ico HTTP/1.1" 200 11502 "-" "Mozilla/4.0 (compatible;)" "-" … "GET /misc/common.css HTTP/1.1" 200 894 "-" "Mozilla/4.0 (compatible;)" "-"
Blue Coat proxies
With a little header analysis, I now know that these requests are caused by Blue Coat’s proxy products. These proxies seem to employ a pre-fetching strategy, meaning they analyze pages as they download them and follow links so that future requests can be served from the proxy cache.
Who uses their proxies? I think Hewlett-Packard do, I know Citigroup and Nokia do. In fact I think a lot of companies have their proxies installed judging from the entries in my header log file.
Blue Coat’s stealth crawling
I could live with the fact that their software makes a ton of highly speculative requests but Blue Coat also have been stealth scanning my web site (most likely for malware) – just like Symantec.
Pages
Page 1 · Page 2 · Page 3 · Page 4 · Next Page »
Subscribe
RSS 2.0, Atom or subscribe by Email.
Top Posts
- DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
- 6 fast jQuery Tips: More basic Snippets
- xslt.js version 3.2 released
- xslt.js version 3.0 released XML XSLT now with jQuery plugin
- Forum Scanners - prevent forum abuse
- Automate JavaScript compression with YUI Compressor and /packer/