Corporate web abuse: The worst offenders from Cyveillance to PicScout

Posted 2007-10-15 in Spam by Johann.

How to deal with web abusers

The proper way to deal with web abusers.

Not all abuse in the web is caused by individual spammers and scrapers. Some companies abuse other people’s web servers a lot more than the average comment spammer.

Blocking spambots and bad bots by user agent is less effective against corporate abusers who stealth crawl web sites. The best option is to block their networks by IP address.

In this blog entry, I’ll describe 7 web abusers, what they do and how you can prevent them from entering your website.

Cyveillance

Cyveillance does the dirty work for the RIAA and the MPAA.

[They scan and try] to gain unauthorized access to P2P networks, Web servers, IRC servers, FTP servers, and mail servers, accross the net looking for MP3s, and movie titles. They will even go as far as to attempt to break into your systems… After the scanning is done, the site is archived, and the information sold the the RIAA, and MPAA, so that they can turn around and sue your 12 year old child, or 80 year old grandmother, and get away with it.

Evidence

38.100.41.100 … "GET …JavaScript... HTTP/1.1" 403 1971 "news:1187732779.931620.254640@a39g2000hsc.googlegroups.com" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)" "-"

38.100.41.100 … "GET /help/sitemap.html HTTP/1.1" 403 1971 "…" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)" "-"

38.100.41.100 … "GET / HTTP/1.1" 403 1971 "…" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)" "-"

38.100.41.100 … "GET /help/copyright.html HTTP/1.1" 403 1971 "…" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)" "-"

38.100.41.100 … "GET /blablog/www/2006/06/08/ HTTP/1.1" 403 1971 "…" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)" "-"

38.100.41.100 … "GET /misc/handheld.css HTTP/1.1" 403 1971 "…" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)" "-"

Netblocks

  • 38.100.19.8/29
  • 38.100.21.0/24
  • 38.100.41.64/26
  • 38.105.71.0/25
  • 38.105.83.0/27
  • 38.112.21.140/30
  • 38.118.42.32/29
  • 65.213.208.128/27
  • 65.222.176.96/27
  • 65.222.185.72/29
  • 151.173.0.0/16

More netblocks might be listed on the Cyveillance Exposed website. I believe Cyveillance frequently obtains new netblocks and drops old ones. I have tried to find all but I cannot guarantee the above list is complete.

Websense, Inc.

Websense does stealth crawling and domain scanning with various user agents. I believe they are triggered when you ping Moreover.

Evidence

208.80.193.47 … "GET / HTTP/1.0" 403 345 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312465)" "-"

208.80.193.43 … "GET / HTTP/1.1" 403 345 "-" "Mozilla/5.0 (compatible; Konqueror/3.1-rc1; i686 Linux; 20020921)" "-"

208.80.193.38 … "GET ….html?page=comments HTTP/1.1" 403 3735 "-" "Mozilla/5.0 (compatible; Konqueror/3.1-rc6; i686 Linux; 20021203)" "-"

Netblocks

  • 66.194.6.0/24
  • 208.80.192.0/21
  • 204.15.64.0/21

VeriSign Infrastructure & Operations

I covered the stealth scaping and domain scanning going on in these netblocks before. Basically, a bunch of browser-looking user agents request robots.txt and then the root directory.

Evidence

208.17.184.48 … "GET /robots.txt HTTP/1.1" 403 1933 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7" "-"

208.17.184.48 … "GET / HTTP/1.1" 403 1933 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7" "-"

69.36.158.62 … "GET /robots.txt HTTP/1.1" 301 0 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7" "-"

69.36.158.62 … "GET / HTTP/1.1" 301 0 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7" "-"

69.36.158.63 … "GET /robots.txt HTTP/1.1" 403 345 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7" "-"

69.36.158.63 … "GET / HTTP/1.1" 403 345 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050
922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7" "-"

Netblocks

Blue Coat Systems, Inc.

Blue Coat builds web proxies with malware and content filtering built-in. Their proxy software pre-fetches content as quickly as possible which makes it look like an abusive scraper. Blue Coat also stealth crawls the web.

Evidence

217.169.46.98 … "GET /software/nativecall HTTP/1.1" 403 3735 "-" "Mozilla/4.0" "-"

217.169.46.98 … "GET /software/nativecall/url(%22/misc/handheld.css%22)%20handheld HTTP/1.1" 403 3735 "-" "Mozilla/4.0" "-"

217.169.46.98 … "GET /blog/music/effects/Kay-Fuzz-Tone.html HTTP/1.1" 403 3735 "-" "Mozilla/4.0" "-"

217.169.46.98 … "GET /blog/music/effects/url(%22/misc/handheld.css%22)%20handheld HTTP/1.1" 403 3735 "-" "Mozilla/4.0" "-"

Netblocks

  • 8.21.4.254 (?)
  • 65.46.48.192/30
  • 65.160.238.176/28
  • 204.246.128.0/20
  • 208.115.138.0/23
  • 216.16.247.0/28
  • 217.169.46.96/28

Secure Computing

Secure Computing scan the web looking for viruses and malware. Or so. The problem is they’re just guessing URLs which is not exactly showing a lot of cleverness.

Evidence

206.169.110.66 … "GET /main.htm HTTP/1.1" 403 1720 "-" "page_prefetcher" "-"

206.169.110.66 … "GET /main.php HTTP/1.1" 403 1720 "-" "page_prefetcher" "-"

206.169.110.66 … "GET /public.html HTTP/1.1" 403 1720 "-" "page_prefetcher" "-"

(repeat ad nauseam)

Netblocks

  • 206.169.110.0/24

MarkMonitor

MarkMonitor scans the web for phishing websites and counterfeit products. Unfortunately, I’m not selling anything nor do I want your credit card number so it’s safe from my perspective to block MarkMonitor.

Evidence

They cause a ton of requests to the same file which indicates a badly written bot. Or a desperate attempt at trying to find out if the page uses cloaking.

64.124.14.109 … "GET …already.html HTTP/1.1" 200 11723 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4) Gecko/20011128 Netscape6/6.2.1" "-"

64.124.14.100 … "GET …already.html HTTP/1.1" 200 11723 "-" "Mozilla/5.0 (compatible; Konqueror/3.1; Linux; en)" "-"

64.124.14.100 … "GET …already.html HTTP/1.1" 200 11722 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040218 Galeon/1.3.12" "-"

64.124.14.101 … "GET …already.html HTTP/1.1" 200 11715 "-" "Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; rv:1.0.1) Gecko/20020823 Netscape/7.0" "-"

64.124.14.101 … "GET …already.html HTTP/1.1" 200 11576 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; DigExt)" "-"

Netblocks

  • 64.124.14.0/25

PicScout

Little is known about PicScout, only that they stealth scan the web for unauthorized uses of images of Getty and those of other clients. Generally, whenever you have a company doing the dirty work for other seemingly legit companies, I recommend blocking all of them.

Netblocks

  • 82.80.248.0/21. Bezeqint-Hosting. This is where the actual crawler is running from.
  • 62.0.8.0/24. PicScout. Employee office?
  • 206.28.72.0/21. Getty images.

Evidence

Their crawler changes user-agents mid-way. Its crawl rate borders on DOS attacks.

82.80.249.145 … [22/Sep/2007:04:52:23 +0200] "GET …content.html HTTP/1.1" 200 11020 "…" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; (R1 1.1); .NET CLR 1.1.4322)" "JSESSIONID=OMFGBPKMCFEN"

82.80.249.145 … [22/Sep/2007:04:52:23 +0200] "GET …content.html HTTP/1.1" 200 11012 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)" "-"

82.80.249.145 … [22/Sep/2007:04:52:26 +0200] "GET …part-1.html HTTP/1.1" 200 9697 "…" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; (R1 1.1); .NET CLR 1.1.4322)" "JSESSIONID=OMFGBPKMCFEN"

82.80.249.145 … [22/Sep/2007:04:52:26 +0200] "GET …part-1.html HTTP/1.1" 200 9697 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)" "-"

The verdict

Web abuse is big business. Fortunately, you can control who you want on your web site and who wont’t have access.

If your web hoster doesn’t let you control access to your web site (via .htaccess or lighttpd.conf for the lighttpd users), then choose another one. Protecting your web site from abuse should be in your interest and the one of your web hoster.

13 comments

An introduction to blocking spambots and bad bots

Posted 2007-07-19 in Spam by Johann.

This post explains how to prevent robots from accessing your website.

Why certain robots should be excluded

Not all software on the net is used for activities beneficial to you. Bots are used to

  • harvest email address (which are then sold and spammed),
  • scrape websites (stealing content),
  • scan for unlicensed content (MP3 files, movies),
  • spam your blog.

Good bots

robots.txt is a file that specifies how robots may interact with your website. robots.txt is always placed at the root folder of a domain (https://johannburkard.de/robots.txt). If a bot respects robots.txt, there really is no need to take any of the measures described below.

Bots from the large search engines respect robots.txt. Some bots used for research also do, such as IRL Bot.

To prevent a bot that respects robots.txt from crawling a certain area of your website, add an entry to your robots.txt file.

User-agent: <bot name>
Disallow: /private/stuff

Don’t forget to validate your robots.txt.

Bad bots

Bots that don’t respect robots.txt are bad bots. There are different strategies how they can be blocked.

By user agent

In many cases, bad bots can be addressed by their user agent – a unique identification string sent together with requests. To block a bot based on it’s user agent, configure your web server accordingly (.htaccess for Apache, lighttpd.conf for lighttpd).

In this example, I configure my web server to block all requests made by user agents containing PleaseCrawl/1.0 (+http://www.pleasefeed.com; FreeBSD) and Fetch API:

$HTTP["useragent"] =~ "(…|PleaseCrawl|Fetch API|…)" {
 url.access-deny = ( "" )
}

Examples where this strategy can be effective:

  • Known bad bots.
  • HTTP programming libraries such as lwp-trivial or Snoopy.
  • Spam software that contains small hints in the user agent string, such as Bork-edition.

By IP address

Companies that perform stealth crawling mostly operate from their own netblocks. Spammers or scrapers also might have their own IP addresses. You can block these addresses on your firewall or in your webserver configuration.

Example:

# Symantec

$HTTP["remoteip"] == "65.88.178.0/24" {
 url.access-deny = ( "" )
}

Examples where this strategy can be effective:

  • “Brand monitoring” and “rights management” companies such as Cyveillance, NameProtect, BayTSP.
  • Spam-friendly hosters, such as Layered Technologies.

By behavior

Blocking bots by their behavior is probably the most effective but also the most complicated strategy. I admit I don’t know a software for behavior-based blocking. The Bad Behavior software only does this on a per-request basis.

The idea is that multiple requests from one IP address are analyzed. To be effective, this analysis must be performed in real-time or near-real-time.

Some factors that could be analyzed:

  • Entry point into the site.
  • Refering page. Accessing a page deep in a site without a referer is somewhat suspiciuos – some browsers can be configured to turn off the referer however.
  • Which referer is passed on to scripts and images.
  • Loading of images, CSS or scripts. Bots rarely request the same files real browsers do.
  • Time intervals between requests. Loading multiple pages in very short (or very long) intervals is usually a sign of bots.
  • Deviation from request intervals. Human visitors rarely request one page every 10 seconds.
  • Incorrect URLs or invalid requests.

Name Intelligence, Inc. won’t stop abuse

Posted 2008-06-01 in Spam by Johann.

Name Intelligence, Inc. is one of the few corporate web abusers that I seriously consider blocking at the firewall. Not because of their stealth crawling but because they assume webmasters are really, really stupid and can’t tell fake bots from real ones.

66.249.16.211 … "GET / HTTP/1.1" 403 345 "http://whois.domaintools.com/…" "Mozilla/5.0 (compatible; YodaoBot/1.0; http://www.yodao.com/help/webmaster/spider/; )"

66.249.16.212 … "GET / HTTP/1.1" 403 4252 "http://whois.domaintools.com/eaio.com" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"

3 comments

Forum Scanners - prevent forum abuse

Posted 2008-04-13 in Spam by Johann.

Spammers use forum scanning to find forums that are vulnerable to automated sign-up and posting.

If you are an administrator of a forum and want to prevent automated abuse for your board, then consider blocking the netblocks below.

Forum scanners

10 89.28.14.104    (mail.cigoutlet.us, STARNET S.R.L, MV, 89.28.0.0/17)
10 84.19.180.90    (Keyweb AG, DE, 84.19.160.0/19 and 87.118.64.0/18)
10 77.91.227.113   (WEBALTA, RU, 77.91.224.0/21)
10 77.244.209.198  (gw.fobosnet.ru, RZT Network, RU, 77.244.208.0/20)
10 72.232.162.34   (Layered Technologies, US)
 6 87.118.116.100  (Keyweb AG)
 6 195.210.167.45  (COMSTAR, RU, 195.210.128.0/18)
 5 147.202.28.25   (TEAM Technologies, US, 147.202.0.0/16)
 4 87.118.118.173  (Keyweb AG)
 4 85.255.115.146  (UkrTeleGroup, UA, 85.255.112.0/20)
 2 89.178.154.161  (CORBINA-BROADBAND, RU, Dial-Up)
 2 89.149.226.58   (netdirekt e.K., DE, 89.149.192.0/18)
 2 89.149.208.221  (netdirekt e.K.)
 2 87.99.92.36     (Telenet, LV, Dial-Up)
 2 87.236.29.207   (n207.cpms.ru, CPMS Network, RU, 87.236.24.0/21)
 2 87.118.120.127  (Keyweb AG)
 2 87.118.106.41   (Keyweb AG)
 2 84.200.29.124   (Internet-Homing GmbH, DE, 84.200.29.0/24)
 2 78.157.143.201  (VdHost Ltd, LV, 78.157.143.128/25)
 2 76.120.171.54   (Comcast, US)
 2 75.126.166.122  (Softlayer, US)
 2 72.9.105.42     (Ezzi.net, US, 66.199.224.0/19, 72.9.96.0/20)
 2 72.232.7.10     (Layered Technologies)
 2 69.46.23.155    (Hivelocity Ventures Corporation, US, 69.46.0.0/19)
 2 69.46.16.166    (Hivelocity Ventures Corporation)
 2 69.126.44.157   (Optimum Online, Dial-Up)
 2 60.21.161.73    (CNCGROUP Liaoning province network, CN)
 2 220.130.142.189 (HINET, TW)
 2 200.142.97.194  (Mundivox, BR)
 2 195.2.114.31    (MICROLINK, LV)
 1 61.235.150.228  (China Railway, CN)
 1 60.209.21.101   (China Network, CN)
 1 60.190.79.24    (something in China)
 1 200.27.116.188  (Telmex Chile, CL)

Notes

  • WEBALTA is/was a Russian search engine. Apparently, they also offer hosting.
  • A ton of abuse is coming from UkrTeleGroup, not just forum scanning.
  • Forum scanners seem to like the Keyweb AG.
  • This list was compiled over several weeks.
  • The first number roughly lists the scanning frequency.
  • I don’t have a forum on johannurkard.de.

Pages

Page 1 · Page 2 · Page 3 · Page 4 · Next Page »

Subscribe

RSS 2.0, Atom or subscribe by Email.

Top Posts

  1. DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
  2. 6 fast jQuery Tips: More basic Snippets
  3. xslt.js version 3.2 released
  4. xslt.js version 3.0 released XML XSLT now with jQuery plugin
  5. Forum Scanners - prevent forum abuse
  6. Automate JavaScript compression with YUI Compressor and /packer/

Navigation