panscient.com = bad bot
Posted 2007-05-23 in Spam by Johann.
Another one for the garbage can. It creates bad requests and doesn’t respect robots.txt
although they claim to do so.
38.99.203.110 … "GET / HTTP/1.1" … "panscient.com" (robots.txt not asked for) 38.99.203.110 … "GET /;5B=/ HTTP/1.1" … "panscient.com" (WTF?) 38.99.203.110 … "GET /<prohibited directory> HTTP/1.1" … "panscient.com"
7 comments
This much Nutch is too much Nutch
Posted 2007-09-17 in Spam by Johann.
Nutch is like giving free TNT sticks to children.
In theory it could be used for useful things.
In reality most of it’s users appear to be scrapers and wannabe-search-engines.
Nutch bot user agents
I searched half a year worth of logfiles for Nutch
crawlers:
//Nutch-0.9-dev (compatible; SynooBot/0.9; http://www.synoo.com/search/bot.html)
Bigsearch.ca/Nutch-1.0-dev (Bigsearch.ca Internet Spider; http://www.bigsearch.ca/; info@enhancededge.com)
Bloodhound/Nutch-0.9 (Testing Crawler for Research - obeys robots.txt and robots meta tags ; http://balihoo.com/index.aspx; robot at balihoo dot com)
CazoodleBot/Nutch-0.9-dev (CazoodleBot Crawler; http://www.cazoodle.com/cazoodlebot; cazoodlebot@cazoodle.com)
CS/Nutch-0.9
disco/Nutch-0.9 (experimental crawler ... please email imagine@gmail.com if problems observed; imagine@gmail.com)
disco/Nutch-0.9 (experimental crawler ... please email imagine@gmail.com if problems observed; nedrocks@gmail.com)
Firefox/Nutch-0.8 (Test Robot; winyio@ustc)
HD nutch agent/1.0
HPL/Nutch-0.9
ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; crawl@ilial.com)
ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company.; http://www.ilial.com/crawler; crawl@ilial.com)
ImageShack/Nutch-1.0-dev
infomisa/Nutch-0.9
Krugle/Krugle,Nutch/0.8+ (Krugle web crawler; http://corp.krugle.com/crawler/info.html; webcrawler@krugle.com)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.01/Nutch-0.8.1 (http://lucene.apache.org/nutch/about.html; http://lucene.apache.org/nutch/bot.html; mail@dev.null)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.01/Nutch-0.9 (http://lucene.apache.org/nutch/about.html; http://lucene.apache.org/nutch/bot.html; mail@dev.null)
MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://vwbot.cs.uiuc.edu; mqbot@cs.uiuc.edu)
Nutch/Nutch-0.8.1 (Nutch; Nutch; Nutch)
Nutch/Nutch-0.9
NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
NutchCVS/Nutch-0.9 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
NutchForUW/Nutch-0.9 (Nutch agent; http://lucene.apache.org/nutch; test at nutch dot org)
Pluggd/Nutch-0.9 (Pluggd automated crawler; http://www.pluggd.com; support at pluggd dot com)
sait/Nutch-0.9 (SAIT TEST AGENT; http://www.sait.samsung.co.kr)
tellbaby/Nutch-0.9 (www.tellbaby.com)
tellbaby/Nutch-1.0-dev (http://www.tellbaby.com)
TestCrawler/Nutch-0.9 (Testing Crawler for Research ; http://chitchit.org/TestCrawler.html; amitjain at spro dot net)
WannesWeb-Group Agent/Nutch-0.9 (Agent de recherche pour le moteur WannesWeb-Group; www.wannesweb-Greoup.com; agent@wannesweb-group.com)
Webscope/Nutch-0.9-dev (http://www.cs.washington.edu/homes/mjc/agent.html)
wectar/Nutch-0.9 (nectar extracted form the glorious web; http://goosebumps4all.net/wectar; see website)
yggdrasil/Nutch-0.9 (yggdrasil biorelated search engine; www dot biotec dot tu minus dresden do de slash schroeder; heiko dot dietze at biotec dot tu minus dresden dot de)
Nutch? No, thanks.
Some of these user agent strings do not even contain an email address. Others are attempting to fake existing user agents.
Stopping Nutch
Nutch-based crawlers can be blocked by filtering out all user agents containing Nutch
or nutch
. In lighttpd, this looks as follows:
$HTTP["useragent"] ~= "(Nutch|nutch)" { url.access-deny = ( "" ) }
10 comments
panscient.com = still bad bot
Posted 2007-06-07 in Spam by Johann.
Just when the alleged CEO of panscient.com says their crawlers currently obey the standard robots protocol
, I find the following in my logs:
$ grep "\"panscient.com\"" <logfile> 38.99.203.110 johannburkard.de [02/Jun/2007:06:56:54 +0200] "GET /?johannburkard.com HTTP/1.1" 403 3778 "-" "panscient.com" "-" $
What were they looking for here? That I registered a .com
of my name? I have, thanks. How did they find the domain? Last time I checked there were no links to johannburkard.com
.
Of course it still does not ask for robots.txt
. Yawn…
15 comments
Pages
Subscribe
RSS 2.0, Atom or subscribe by Email.
Top Posts
- DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
- 6 fast jQuery Tips: More basic Snippets
- xslt.js version 3.2 released
- xslt.js version 3.0 released XML XSLT now with jQuery plugin
- Forum Scanners - prevent forum abuse
- Automate JavaScript compression with YUI Compressor and /packer/