Attributor: Abuse other people’s resources with confidence
Posted 2008-03-12 in Spam by Johann.
Attributor constantly scans billions of web pages to find copies of your content across the Internet
…in stealth mode, of course.
$ egrep '^64.41.145.' <logfile> 64.41.145.193 … "GET /blog/?flavor=rss2 HTTP/1.1" 301 0 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" "-" 64.41.145.193 … "GET /blog/ HTTP/1.1" 200 18934 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" "-"
Their other crawler is slightly more “open”:
64.41.145.167 … "GET /software/stringsearch/ HTTP/1.0" 403 4252 "…" "Mozilla/5.0 (compatible; attributor/1.13.2 +http://www.attributor.com)" "-"
Block Attributor
Let’s perform the standard RDNS trick.
(Asked whois.arin.net:43 about +64.41.145.193) … OrgName: Attributor Corporation OrgID: ATTRI Address: 1779 Woodside Road Suite 200 Address: ATTN Adrian McDermott City: REDWOOD CITY StateProv: CA PostalCode: 94061 Country: US NetRange: 64.41.145.0 - 64.41.145.255 CIDR: 64.41.145.0/24
Attributor’s netblock: 64.41.145.0/24
.
The top 10 spam bot user agents you MUST block. NOW.
Posted 2007-06-25 in Spam by Johann.
Spambots and badly behaving bots seem to be all the rage this year. While it will be very hard to block all of them, you can do a lot to keep most of the comment spammers away from your blog and scrapers from harvesting your site.
In this entry, I assume you know how to read regular expressions. Note that I randomly mix comment spambots, scrapers and email harvesters.
User agent strings to block
""
. That’s right. An empty user agent. If someone can’t be arsed to set a user-agent, why should you serve him anything?^Java
. Not necessarily anything containingJava
but user agents starting withJava
.^Jakarta
. Don’t ask. Just block.User-Agent
. User agents containingUser-Agent
are most likely spambots operating from Layered Technologies’s network – which you should block as well.compatible ;
. Note the extra space. Email harvester."Mozilla"
. Only the stringMozilla
of course.libwww
,lwp-trivial
,curl
,PHP/
,urllib
,GT::WWW
,Snoopy
,MFC_Tear_Sample
,HTTP::Lite
,PHPCrawl
,URI::Fetch
,Zend_Http_Client
,http client
,PECL::HTTP
. These are all HTTP libraries I DO NOT WANT.panscient.com
. Who would forget panscient.com in a list of bad bots?IBM EVV
,Bork-edition
,Fetch API Request
.[A-Z][a-z]{3,} [a-z]{4,} [a-z]{4,}
. This matches nonsense user-agents such asZobv zkjgws pzjngq
. Most of these originate from layeredtech.com. Did I mention you should block Layered Technologies?WEP Search
,Wells Search II
,Missigua Locator
,ISC Systems iRc Search 2.1
,Microsoft URL Control
,Indy Library
. Oldies but goldies. Sort of.
More stuff to potentially block
You might also block some or all of the following user agents.
Nutch
larbin
heritrix
ia_archiver
These are coming from open source search engines or crawlers. Or the Internet Archive. I’m not seeing any benefit to my site so I block them as well.
Links
The Project Honeypot statistics are usually a good place to keep an eye on.
- Harvester User Agents | Project Honey Pot
- Top Web Robots | Comment Spammer Agents | Project Honey Pot
- Behind the Scenes with Apache's .htaccess – good resource for the Apache users (I know they exist).
6 comments
Block empty User Agent headers
Posted 2008-04-19 in Spam by Johann.
Blocking requests without a user agent header is a simple step to reduce web abuse. I’ve shown before that this can be a significant number.
On this server, 985 requests without user agent were made in the last four weeks which constitutes 6 % of the 14388 blocked requests. 6 % might not sound much but once I started white listing user agents, the percentage of blocked requests went up from less than 1 % to over 4 %. Unless you are also white listing user agents and block entire netblocks as aggressively as I do, your number can be higher.
lighttpd
Blocking empty user agents is simple in lighttpd. Edit your lighttpd.conf
as follows:
Ensure that mod_access
is loaded:
server.modules = ( "mod_accesslog", "mod_access", … other modules … )
Add the following line:
$HTTP["useragent"] == "" { url.access-deny = ( "" ) }
Reload the lighttpd configuration and you’re done.
Apache
Enable mod_rewrite
and add the following to your configuration:
RewriteEngine on RewriteCond %{HTTP_USER_AGENT} ^$ [OR] RewriteRule ^.* - [F]
Contributed by Andrew.
If you use a web server other than lighttpd or Apache, please add the configuration to this entry. Thanks.
7 comments
Mozilla/4.0 (compatible;)
Posted 2007-05-09 in Spam by Johann.
When I first published this entry in May 2007, I thought this was just another web scraper.
… "GET / HTTP/1.1" 200 7518 "-" "Mozilla/4.0 (compatible;)" "-" … "GET /help/copyright.html HTTP/1.1" 200 4127 "-" "Mozilla/4.0 (compatible;)" "-" … "GET /help/sitemap.html HTTP/1.1" 200 4902 "-" "Mozilla/4.0 (compatible;)" "-" … "GET /favicon.ico HTTP/1.1" 200 11502 "-" "Mozilla/4.0 (compatible;)" "-" … "GET /misc/common.css HTTP/1.1" 200 894 "-" "Mozilla/4.0 (compatible;)" "-"
Blue Coat proxies
With a little header analysis, I now know that these requests are caused by Blue Coat’s proxy products. These proxies seem to employ a pre-fetching strategy, meaning they analyze pages as they download them and follow links so that future requests can be served from the proxy cache.
Who uses their proxies? I think Hewlett-Packard do, I know Citigroup and Nokia do. In fact I think a lot of companies have their proxies installed judging from the entries in my header log file.
Blue Coat’s stealth crawling
I could live with the fact that their software makes a ton of highly speculative requests but Blue Coat also have been stealth scanning my web site (most likely for malware) – just like Symantec.
Pages
Page 1 · Page 2 · Page 3 · Page 4 · Next Page »
Subscribe
RSS 2.0, Atom or subscribe by Email.
Top Posts
- DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
- 6 fast jQuery Tips: More basic Snippets
- xslt.js version 3.2 released
- xslt.js version 3.0 released XML XSLT now with jQuery plugin
- Forum Scanners - prevent forum abuse
- Automate JavaScript compression with YUI Compressor and /packer/