Corporate web abuse: The worst offenders from Cyveillance to PicScout

Posted 2007-10-15 in Spam by Johann.

How to deal with web abusers

The proper way to deal with web abusers.

Not all abuse in the web is caused by individual spammers and scrapers. Some companies abuse other people’s web servers a lot more than the average comment spammer.

Blocking spambots and bad bots by user agent is less effective against corporate abusers who stealth crawl web sites. The best option is to block their networks by IP address.

In this blog entry, I’ll describe 7 web abusers, what they do and how you can prevent them from entering your website.

Cyveillance

Cyveillance does the dirty work for the RIAA and the MPAA.

[They scan and try] to gain unauthorized access to P2P networks, Web servers, IRC servers, FTP servers, and mail servers, accross the net looking for MP3s, and movie titles. They will even go as far as to attempt to break into your systems… After the scanning is done, the site is archived, and the information sold the the RIAA, and MPAA, so that they can turn around and sue your 12 year old child, or 80 year old grandmother, and get away with it.

Evidence

38.100.41.100 … "GET …JavaScript... HTTP/1.1" 403 1971 "news:1187732779.931620.254640@a39g2000hsc.googlegroups.com" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)" "-"

38.100.41.100 … "GET /help/sitemap.html HTTP/1.1" 403 1971 "…" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)" "-"

38.100.41.100 … "GET / HTTP/1.1" 403 1971 "…" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)" "-"

38.100.41.100 … "GET /help/copyright.html HTTP/1.1" 403 1971 "…" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)" "-"

38.100.41.100 … "GET /blablog/www/2006/06/08/ HTTP/1.1" 403 1971 "…" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)" "-"

38.100.41.100 … "GET /misc/handheld.css HTTP/1.1" 403 1971 "…" "Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)" "-"

Netblocks

  • 38.100.19.8/29
  • 38.100.21.0/24
  • 38.100.41.64/26
  • 38.105.71.0/25
  • 38.105.83.0/27
  • 38.112.21.140/30
  • 38.118.42.32/29
  • 65.213.208.128/27
  • 65.222.176.96/27
  • 65.222.185.72/29
  • 151.173.0.0/16

More netblocks might be listed on the Cyveillance Exposed website. I believe Cyveillance frequently obtains new netblocks and drops old ones. I have tried to find all but I cannot guarantee the above list is complete.

Websense, Inc.

Websense does stealth crawling and domain scanning with various user agents. I believe they are triggered when you ping Moreover.

Evidence

208.80.193.47 … "GET / HTTP/1.0" 403 345 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Q312465)" "-"

208.80.193.43 … "GET / HTTP/1.1" 403 345 "-" "Mozilla/5.0 (compatible; Konqueror/3.1-rc1; i686 Linux; 20020921)" "-"

208.80.193.38 … "GET ….html?page=comments HTTP/1.1" 403 3735 "-" "Mozilla/5.0 (compatible; Konqueror/3.1-rc6; i686 Linux; 20021203)" "-"

Netblocks

  • 66.194.6.0/24
  • 208.80.192.0/21
  • 204.15.64.0/21

VeriSign Infrastructure & Operations

I covered the stealth scaping and domain scanning going on in these netblocks before. Basically, a bunch of browser-looking user agents request robots.txt and then the root directory.

Evidence

208.17.184.48 … "GET /robots.txt HTTP/1.1" 403 1933 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7" "-"

208.17.184.48 … "GET / HTTP/1.1" 403 1933 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7" "-"

69.36.158.62 … "GET /robots.txt HTTP/1.1" 301 0 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7" "-"

69.36.158.62 … "GET / HTTP/1.1" 301 0 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7" "-"

69.36.158.63 … "GET /robots.txt HTTP/1.1" 403 345 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7" "-"

69.36.158.63 … "GET / HTTP/1.1" 403 345 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050
922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7" "-"

Netblocks

Blue Coat Systems, Inc.

Blue Coat builds web proxies with malware and content filtering built-in. Their proxy software pre-fetches content as quickly as possible which makes it look like an abusive scraper. Blue Coat also stealth crawls the web.

Evidence

217.169.46.98 … "GET /software/nativecall HTTP/1.1" 403 3735 "-" "Mozilla/4.0" "-"

217.169.46.98 … "GET /software/nativecall/url(%22/misc/handheld.css%22)%20handheld HTTP/1.1" 403 3735 "-" "Mozilla/4.0" "-"

217.169.46.98 … "GET /blog/music/effects/Kay-Fuzz-Tone.html HTTP/1.1" 403 3735 "-" "Mozilla/4.0" "-"

217.169.46.98 … "GET /blog/music/effects/url(%22/misc/handheld.css%22)%20handheld HTTP/1.1" 403 3735 "-" "Mozilla/4.0" "-"

Netblocks

  • 8.21.4.254 (?)
  • 65.46.48.192/30
  • 65.160.238.176/28
  • 204.246.128.0/20
  • 208.115.138.0/23
  • 216.16.247.0/28
  • 217.169.46.96/28

Secure Computing

Secure Computing scan the web looking for viruses and malware. Or so. The problem is they’re just guessing URLs which is not exactly showing a lot of cleverness.

Evidence

206.169.110.66 … "GET /main.htm HTTP/1.1" 403 1720 "-" "page_prefetcher" "-"

206.169.110.66 … "GET /main.php HTTP/1.1" 403 1720 "-" "page_prefetcher" "-"

206.169.110.66 … "GET /public.html HTTP/1.1" 403 1720 "-" "page_prefetcher" "-"

(repeat ad nauseam)

Netblocks

  • 206.169.110.0/24

MarkMonitor

MarkMonitor scans the web for phishing websites and counterfeit products. Unfortunately, I’m not selling anything nor do I want your credit card number so it’s safe from my perspective to block MarkMonitor.

Evidence

They cause a ton of requests to the same file which indicates a badly written bot. Or a desperate attempt at trying to find out if the page uses cloaking.

64.124.14.109 … "GET …already.html HTTP/1.1" 200 11723 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4) Gecko/20011128 Netscape6/6.2.1" "-"

64.124.14.100 … "GET …already.html HTTP/1.1" 200 11723 "-" "Mozilla/5.0 (compatible; Konqueror/3.1; Linux; en)" "-"

64.124.14.100 … "GET …already.html HTTP/1.1" 200 11722 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040218 Galeon/1.3.12" "-"

64.124.14.101 … "GET …already.html HTTP/1.1" 200 11715 "-" "Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; rv:1.0.1) Gecko/20020823 Netscape/7.0" "-"

64.124.14.101 … "GET …already.html HTTP/1.1" 200 11576 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; DigExt)" "-"

Netblocks

  • 64.124.14.0/25

PicScout

Little is known about PicScout, only that they stealth scan the web for unauthorized uses of images of Getty and those of other clients. Generally, whenever you have a company doing the dirty work for other seemingly legit companies, I recommend blocking all of them.

Netblocks

  • 82.80.248.0/21. Bezeqint-Hosting. This is where the actual crawler is running from.
  • 62.0.8.0/24. PicScout. Employee office?
  • 206.28.72.0/21. Getty images.

Evidence

Their crawler changes user-agents mid-way. Its crawl rate borders on DOS attacks.

82.80.249.145 … [22/Sep/2007:04:52:23 +0200] "GET …content.html HTTP/1.1" 200 11020 "…" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; (R1 1.1); .NET CLR 1.1.4322)" "JSESSIONID=OMFGBPKMCFEN"

82.80.249.145 … [22/Sep/2007:04:52:23 +0200] "GET …content.html HTTP/1.1" 200 11012 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)" "-"

82.80.249.145 … [22/Sep/2007:04:52:26 +0200] "GET …part-1.html HTTP/1.1" 200 9697 "…" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; (R1 1.1); .NET CLR 1.1.4322)" "JSESSIONID=OMFGBPKMCFEN"

82.80.249.145 … [22/Sep/2007:04:52:26 +0200] "GET …part-1.html HTTP/1.1" 200 9697 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)" "-"

The verdict

Web abuse is big business. Fortunately, you can control who you want on your web site and who wont’t have access.

If your web hoster doesn’t let you control access to your web site (via .htaccess or lighttpd.conf for the lighttpd users), then choose another one. Protecting your web site from abuse should be in your interest and the one of your web hoster.

13 comments

#1 2008-01-24 by William Faulkner

Good work Johann...

PicScout and Getty are currently my pet hates, although I've not fallen foul of their scams myself.

Having only recently started blocking malicious bots it's nice to see so much useful information here, hope you keep it up.

#2 2008-01-24 by Simon Potter

Thanks Johann really useful,
having just been caught up int he outrageous stand over tactics that Getty uses to intimidate small businesses, I would thoroughly recommend people be particularly conscious of PicScout and if they do try to pin you down with some spurious claim make your you do your research on it before you give them a penny (there is a good thread on FSB (http://www.fsb.org.uk/discuss/forum_posts.asp?TID=194&PN=1).

thanks again.

#3 2008-05-04 by John Kelly

Hi, Came across your site when trying to find out why I get a lot of visits with ips-agent is the user agent. I think I have had visits from all of those you mention. From their behaviour I surmised they were up to no good and have been blocking them....the goold old Deny....for some time. Its nice to see that my thoughts were correct

#4 2008-08-20 by Dave Jones

Same as above posted - your article was first result for 217.169.46.98 so I denied it immediately. When will these guys learn?

#5 2008-12-03 by Pigeon

Thanks for this page, it was the first result when I was trying to find out what "ips-agent" was up to.

Does anyone know anything about, or has experienced any abuse from, 208.43.128.96? I've had this trying to download the same file every ten minutes. It uses the user-agent string "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1" (yes, the close-bracket is missing). I've blocked it with iptables but it's still trying.

#6 2008-12-04 by Johann

Pigeon,

I've seen this incomplete user agent string before. The IP address is from Softlayer which I've blocked as well for past abuse.

#7 2008-12-06 by John

I would like to add attributor.com to the corporate baddies list. This company is a serious bandwidth hog!

Here's the IP block to deny them...

64.41.145.0/24

#8 2008-12-06 by Johann

John,

Attributor is well known around here as well.

#9 2009-01-02 by Rich

Hi Johann, I recently attempted to block scrapers on my home-made site, by blocking anyone who was not a recognised search bot but was making more than two requests per second. My first attempt stored the request times in Session - it didn't work and only managed to block some poor individual who presumably was using one of those 'prefetch links' plug-ins in their browser. I guess this means that these bots do not support Session persistence ie no cookies and no session ID in the URL.

My next attempt stored request times in the database and did sort of work. Anyone making excessive requests per second was presented with a dummy HTML page. However, this seemed to backfire on me rather badly, as almost as soon as the system was in place, traffic to my site practically halved. Quite why I'm not sure - although I know that Google will block sites that it suspects of cloaking, so perhaps somehow I fell foul of that.

I may in the future look again at it, perhaps by trying IP blocking as you suggest, although there is then the issue of keeping the IPs up to date, and also it would be hard to protect against the people I really want to block, namely one-off cowboy outfits scraping content for use for their own dastardly Google-fooling ends.

Good article though, some interesting points there! All the best

#10 2009-01-02 by Rich

oops - I forgot my question I wanted to ask - which was, as the bots do not seem to support Session, do you think that detecting Session support might be a good enough way to universally protect against these kinds of unwelcome visitors?

ie set a session key, and if it's not there on the next visit from that IP, kill them.

Obviously, you'd need to make sure search engines were exempted first!

I suppose you'd also prevent concurrent requests from the same intranet this way though... hmmm

#11 2009-01-02 by Johann

Rich,

the bots from Cyveillance and others will crawl all URLs that you present them. Most of the spambots also deal with cookies just fine.

#12 2009-02-05 by Pigeon

Found another netblock for Verisign and their ips-agent bot.

It downloaded robots.txt and every page on the top level of the site, then started following links on those pages.

69.58.178.29 - - [05/Feb/2009:04:02:09 +0000] "GET http://www.lucy-pinder.tv/rapidshare-downloads/lucy-pinder-mmagm-trailer.avi.html HTTP/1.1" 302 338 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12; ips-agent) Gecko/20050922 Fedora/1.0.7-1.1.fc4 Firefox/1.0.7"

Netblock is 69.58.176.0/20

(This appeared not to work, hope I'm not posting a duplicate.)

#13 2009-02-05 by Johann

Pigeon,

thanks for the hint. In fact, I had this one blocked already but didn't add it to this posting.

Subscribe

RSS 2.0, Atom or subscribe by Email.

Top Posts

  1. DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
  2. 6 fast jQuery Tips: More basic Snippets
  3. xslt.js version 3.2 released
  4. xslt.js version 3.0 released XML XSLT now with jQuery plugin
  5. Forum Scanners - prevent forum abuse
  6. Automate JavaScript compression with YUI Compressor and /packer/

Navigation