Attributor: Abuse other people’s resources with confidence

Posted 2008-03-12 in Spam by Johann.

Attributor constantly scans billions of web pages to find copies of your content across the Internet

…in stealth mode, of course.

$ egrep '^64.41.145.' <logfile>
64.41.145.193 … "GET /blog/?flavor=rss2 HTTP/1.1" 301 0 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" "-"
64.41.145.193 … "GET /blog/ HTTP/1.1" 200 18934 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" "-"

Their other crawler is slightly more “open”:

64.41.145.167 … "GET /software/stringsearch/ HTTP/1.0" 403 4252 "…" "Mozilla/5.0 (compatible; attributor/1.13.2 +http://www.attributor.com)" "-"

Block Attributor

Let’s perform the standard RDNS trick.

(Asked whois.arin.net:43 about +64.41.145.193)
…
 OrgName:    Attributor Corporation 
 OrgID:      ATTRI 
 Address:    1779 Woodside Road  Suite 200 
 Address:    ATTN Adrian McDermott 
 City:       REDWOOD CITY 
 StateProv:  CA 
 PostalCode: 94061 
 Country:    US 
 NetRange:   64.41.145.0 - 64.41.145.255 
 CIDR:       64.41.145.0/24 

Attributor’s netblock: 64.41.145.0/24.

The top 10 spam bot user agents you MUST block. NOW.

Posted 2007-06-25 in Spam by Johann.

Spambots and badly behaving bots seem to be all the rage this year. While it will be very hard to block all of them, you can do a lot to keep most of the comment spammers away from your blog and scrapers from harvesting your site.

In this entry, I assume you know how to read regular expressions. Note that I randomly mix comment spambots, scrapers and email harvesters.

User agent strings to block

  • "". That’s right. An empty user agent. If someone can’t be arsed to set a user-agent, why should you serve him anything?
  • ^Java. Not necessarily anything containing Java but user agents starting with Java.
  • ^Jakarta. Don’t ask. Just block.
  • User-Agent. User agents containing User-Agent are most likely spambots operating from Layered Technologies’s network – which you should block as well.
  • compatible ;. Note the extra space. Email harvester.
  • "Mozilla". Only the string Mozilla of course.
  • libwww, lwp-trivial, curl, PHP/, urllib, GT::WWW, Snoopy, MFC_Tear_Sample, HTTP::Lite, PHPCrawl, URI::Fetch, Zend_Http_Client, http client, PECL::HTTP. These are all HTTP libraries I DO NOT WANT.
  • panscient.com. Who would forget panscient.com in a list of bad bots?
  • IBM EVV, Bork-edition, Fetch API Request.
  • [A-Z][a-z]{3,} [a-z]{4,} [a-z]{4,}. This matches nonsense user-agents such as Zobv zkjgws pzjngq. Most of these originate from layeredtech.com. Did I mention you should block Layered Technologies?
  • WEP Search, Wells Search II, Missigua Locator, ISC Systems iRc Search 2.1, Microsoft URL Control, Indy Library. Oldies but goldies. Sort of.

More stuff to potentially block

You might also block some or all of the following user agents.

  • Nutch
  • larbin
  • heritrix
  • ia_archiver

These are coming from open source search engines or crawlers. Or the Internet Archive. I’m not seeing any benefit to my site so I block them as well.

Links

The Project Honeypot statistics are usually a good place to keep an eye on.

6 comments

Block empty User Agent headers

Posted 2008-04-19 in Spam by Johann.

Blocking requests without a user agent header is a simple step to reduce web abuse. I’ve shown before that this can be a significant number.

On this server, 985 requests without user agent were made in the last four weeks which constitutes 6 % of the 14388 blocked requests. 6 % might not sound much but once I started white listing user agents, the percentage of blocked requests went up from less than 1 % to over 4 %. Unless you are also white listing user agents and block entire netblocks as aggressively as I do, your number can be higher.

lighttpd

Blocking empty user agents is simple in lighttpd. Edit your lighttpd.conf as follows:

Ensure that mod_access is loaded:

server.modules = (
    "mod_accesslog",
    "mod_access",
    … other modules …
)

Add the following line:

$HTTP["useragent"] == "" {
 url.access-deny = ( "" )
}

Reload the lighttpd configuration and you’re done.

Apache

Enable mod_rewrite and add the following to your configuration:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteRule ^.* - [F]

Contributed by Andrew.

If you use a web server other than lighttpd or Apache, please add the configuration to this entry. Thanks.

7 comments

Mozilla/4.0 (compatible;)

Posted 2007-05-09 in Spam by Johann.

When I first published this entry in May 2007, I thought this was just another web scraper.

… "GET / HTTP/1.1" 200 7518 "-" "Mozilla/4.0 (compatible;)" "-"
… "GET /help/copyright.html HTTP/1.1" 200 4127 "-" "Mozilla/4.0 (compatible;)" "-"
… "GET /help/sitemap.html HTTP/1.1" 200 4902 "-" "Mozilla/4.0 (compatible;)" "-"
… "GET /favicon.ico HTTP/1.1" 200 11502 "-" "Mozilla/4.0 (compatible;)" "-"
… "GET /misc/common.css HTTP/1.1" 200 894 "-" "Mozilla/4.0 (compatible;)" "-"

Blue Coat proxies

With a little header analysis, I now know that these requests are caused by Blue Coat’s proxy products. These proxies seem to employ a pre-fetching strategy, meaning they analyze pages as they download them and follow links so that future requests can be served from the proxy cache.

Who uses their proxies? I think Hewlett-Packard do, I know Citigroup and Nokia do. In fact I think a lot of companies have their proxies installed judging from the entries in my header log file.

Blue Coat’s stealth crawling

I could live with the fact that their software makes a ton of highly speculative requests but Blue Coat also have been stealth scanning my web site (most likely for malware) – just like Symantec.

Pages

Page 1 · Page 2 · Page 3 · Page 4 · Next Page »

Subscribe

RSS 2.0, Atom or subscribe by Email.

Top Posts

  1. DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
  2. 6 fast jQuery Tips: More basic Snippets
  3. xslt.js version 3.2 released
  4. xslt.js version 3.0 released XML XSLT now with jQuery plugin
  5. Forum Scanners - prevent forum abuse
  6. Automate JavaScript compression with YUI Compressor and /packer/

Navigation