panscient.com = bad bot

Posted 2007-05-23 in Spam by Johann.

Another one for the garbage can. It creates bad requests and doesn’t respect robots.txt although they claim to do so.

38.99.203.110 … "GET / HTTP/1.1" … "panscient.com"
(robots.txt not asked for)
38.99.203.110 … "GET /;5B=/ HTTP/1.1" … "panscient.com" (WTF?)
38.99.203.110 … "GET /<prohibited directory> HTTP/1.1" … "panscient.com"

7 comments

This much Nutch is too much Nutch

Posted 2007-09-17 in Spam by Johann.

Nutch is like giving free TNT sticks to children.

In theory it could be used for useful things.

In reality most of it’s users appear to be scrapers and wannabe-search-engines.

Nutch bot user agents

I searched half a year worth of logfiles for Nutch crawlers:

//Nutch-0.9-dev (compatible; SynooBot/0.9; http://www.synoo.com/search/bot.html)
Bigsearch.ca/Nutch-1.0-dev (Bigsearch.ca Internet Spider; http://www.bigsearch.ca/; info@enhancededge.com)
Bloodhound/Nutch-0.9 (Testing Crawler for Research - obeys robots.txt and robots meta tags ; http://balihoo.com/index.aspx; robot at balihoo dot com)
CazoodleBot/Nutch-0.9-dev (CazoodleBot Crawler; http://www.cazoodle.com/cazoodlebot; cazoodlebot@cazoodle.com)
CS/Nutch-0.9
disco/Nutch-0.9 (experimental crawler ... please email imagine@gmail.com if problems observed; imagine@gmail.com)
disco/Nutch-0.9 (experimental crawler ... please email imagine@gmail.com if problems observed; nedrocks@gmail.com)
Firefox/Nutch-0.8 (Test Robot; winyio@ustc)
HD nutch agent/1.0
HPL/Nutch-0.9
ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; crawl@ilial.com)
ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company.; http://www.ilial.com/crawler; crawl@ilial.com)
ImageShack/Nutch-1.0-dev
infomisa/Nutch-0.9
Krugle/Krugle,Nutch/0.8+ (Krugle web crawler; http://corp.krugle.com/crawler/info.html; webcrawler@krugle.com)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.01/Nutch-0.8.1 (http://lucene.apache.org/nutch/about.html; http://lucene.apache.org/nutch/bot.html; mail@dev.null)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.01/Nutch-0.9 (http://lucene.apache.org/nutch/about.html; http://lucene.apache.org/nutch/bot.html; mail@dev.null)
MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://vwbot.cs.uiuc.edu; mqbot@cs.uiuc.edu)
Nutch/Nutch-0.8.1 (Nutch; Nutch; Nutch)
Nutch/Nutch-0.9
NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
NutchCVS/Nutch-0.9 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
NutchForUW/Nutch-0.9 (Nutch agent; http://lucene.apache.org/nutch; test at nutch dot org)
Pluggd/Nutch-0.9 (Pluggd automated crawler; http://www.pluggd.com; support at pluggd dot com)
sait/Nutch-0.9 (SAIT TEST AGENT; http://www.sait.samsung.co.kr)
tellbaby/Nutch-0.9 (www.tellbaby.com)
tellbaby/Nutch-1.0-dev (http://www.tellbaby.com)
TestCrawler/Nutch-0.9 (Testing Crawler for Research ; http://chitchit.org/TestCrawler.html; amitjain at spro dot net)
WannesWeb-Group Agent/Nutch-0.9 (Agent de recherche pour le moteur WannesWeb-Group; www.wannesweb-Greoup.com; agent@wannesweb-group.com)
Webscope/Nutch-0.9-dev (http://www.cs.washington.edu/homes/mjc/agent.html)
wectar/Nutch-0.9 (nectar extracted form the glorious web; http://goosebumps4all.net/wectar; see website)
yggdrasil/Nutch-0.9 (yggdrasil biorelated search engine; www dot biotec dot tu minus dresden do de slash schroeder; heiko dot dietze at biotec dot tu minus dresden dot de)

Nutch? No, thanks.

Some of these user agent strings do not even contain an email address. Others are attempting to fake existing user agents.

Stopping Nutch

Nutch-based crawlers can be blocked by filtering out all user agents containing Nutch or nutch. In lighttpd, this looks as follows:

$HTTP["useragent"] ~= "(Nutch|nutch)" {
 url.access-deny = ( "" )
}

10 comments

panscient.com = still bad bot

Posted 2007-06-07 in Spam by Johann.

Just when the alleged CEO of panscient.com says their crawlers currently obey the standard robots protocol, I find the following in my logs:

$ grep "\"panscient.com\"" <logfile>
38.99.203.110 johannburkard.de [02/Jun/2007:06:56:54 +0200] "GET /?johannburkard.com HTTP/1.1" 403 3778 "-" "panscient.com" "-"
$

What were they looking for here? That I registered a .com of my name? I have, thanks. How did they find the domain? Last time I checked there were no links to johannburkard.com.

Of course it still does not ask for robots.txt. Yawn…