This much Nutch is too much Nutch

Posted 2007-09-17 in Spam by Johann.

Nutch is like giving free TNT sticks to children.

In theory it could be used for useful things.

In reality most of it’s users appear to be scrapers and wannabe-search-engines.

Nutch bot user agents

I searched half a year worth of logfiles for Nutch crawlers:

  • //Nutch-0.9-dev (compatible; SynooBot/0.9; http://www.synoo.com/search/bot.html)
  • Bigsearch.ca/Nutch-1.0-dev (Bigsearch.ca Internet Spider; http://www.bigsearch.ca/; info@enhancededge.com)
  • Bloodhound/Nutch-0.9 (Testing Crawler for Research - obeys robots.txt and robots meta tags ; http://balihoo.com/index.aspx; robot at balihoo dot com)
  • CazoodleBot/Nutch-0.9-dev (CazoodleBot Crawler; http://www.cazoodle.com/cazoodlebot; cazoodlebot@cazoodle.com)
  • CS/Nutch-0.9
  • disco/Nutch-0.9 (experimental crawler ... please email imagine@gmail.com if problems observed; imagine@gmail.com)
  • disco/Nutch-0.9 (experimental crawler ... please email imagine@gmail.com if problems observed; nedrocks@gmail.com)
  • Firefox/Nutch-0.8 (Test Robot; winyio@ustc)
  • HD nutch agent/1.0
  • HPL/Nutch-0.9
  • ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; crawl@ilial.com)
  • ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company.; http://www.ilial.com/crawler; crawl@ilial.com)
  • ImageShack/Nutch-1.0-dev
  • infomisa/Nutch-0.9
  • Krugle/Krugle,Nutch/0.8+ (Krugle web crawler; http://corp.krugle.com/crawler/info.html; webcrawler@krugle.com)
  • Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.01/Nutch-0.8.1 (http://lucene.apache.org/nutch/about.html; http://lucene.apache.org/nutch/bot.html; mail@dev.null)
  • Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 8.01/Nutch-0.9 (http://lucene.apache.org/nutch/about.html; http://lucene.apache.org/nutch/bot.html; mail@dev.null)
  • MQBOT/Nutch-0.9-dev (MQBOT Nutch Crawler; http://vwbot.cs.uiuc.edu; mqbot@cs.uiuc.edu)
  • Nutch/Nutch-0.8.1 (Nutch; Nutch; Nutch)
  • Nutch/Nutch-0.9
  • NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
  • NutchCVS/0.7.2 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
  • NutchCVS/Nutch-0.9 (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
  • NutchForUW/Nutch-0.9 (Nutch agent; http://lucene.apache.org/nutch; test at nutch dot org)
  • Pluggd/Nutch-0.9 (Pluggd automated crawler; http://www.pluggd.com; support at pluggd dot com)
  • sait/Nutch-0.9 (SAIT TEST AGENT; http://www.sait.samsung.co.kr)
  • tellbaby/Nutch-0.9 (www.tellbaby.com)
  • tellbaby/Nutch-1.0-dev (http://www.tellbaby.com)
  • TestCrawler/Nutch-0.9 (Testing Crawler for Research ; http://chitchit.org/TestCrawler.html; amitjain at spro dot net)
  • WannesWeb-Group Agent/Nutch-0.9 (Agent de recherche pour le moteur WannesWeb-Group; www.wannesweb-Greoup.com; agent@wannesweb-group.com)
  • Webscope/Nutch-0.9-dev (http://www.cs.washington.edu/homes/mjc/agent.html)
  • wectar/Nutch-0.9 (nectar extracted form the glorious web; http://goosebumps4all.net/wectar; see website)
  • yggdrasil/Nutch-0.9 (yggdrasil biorelated search engine; www dot biotec dot tu minus dresden do de slash schroeder; heiko dot dietze at biotec dot tu minus dresden dot de)

Nutch? No, thanks.

Some of these user agent strings do not even contain an email address. Others are attempting to fake existing user agents.

Stopping Nutch

Nutch-based crawlers can be blocked by filtering out all user agents containing Nutch or nutch. In lighttpd, this looks as follows:

$HTTP["useragent"] ~= "(Nutch|nutch)" {
 url.access-deny = ( "" )
}

10 comments

#1 2007-10-11 by Sean Dean

Bigsearch.ca/Nutch-1.0-dev (Bigsearch.ca Internet Spider; http://www.bigsearch.ca/; info@enhancededge.com)

There is nothing "wannabe-search-engines" about my search engine. The index might be smaller then your standard Google and Yahoo index, but its a work in progress.

Your last comment made sence, if you dont like Nutch then just block it. But then, if someone really wants to spam crawl your site then they (if they have any smarts) remove Nutch alltogether from the user agent.

P.S. Nice site, which I found in my legit search index :)

#2 2008-01-24 by Sean Dean

Johann,

I decided to keep Nutch in part of my user agent string as a way to credit the developers who worked hard to code it. I also try to give back to the Nutch community with support on the mailing lists, and bug reports.

In responce to your other observation, your right in that I dont have this page indexed in my live search index (the one used on the site) but have it sitting in another (bigger) index thats waiting for deployment. Thats being held back due to the lack of hard disk space currently on my servers.

#3 2008-01-24 by Johann

Sean,

if you look carefully, I wrote most of it’s users. If you are not a wannabe-search-engine, congratulations. But then why sending that Nutch user agent with your requests anyway?

#4 2008-01-24 by Letyton Jay

I dont really have anything against nutch driven crawlers. I have found alot of them honour the robots.txt standard.
But I have noticed a couple that don't and a few that really take the piss!

I am currently building a bot-trap to catch dishonourable bots and publish them in a public name-and-shame list.

#5 2008-01-24 by Major Chai

Instead of blindly block Nutch, why not block based on the merit of each, eg. how often its crawls your site, and how many treads it uses ?

Why sending that Nutch user agent with your requests anyway?

Because my idea of Nutch.biz is to use the official Nutch user agent

#6 2008-01-24 by Johann

Major,

do you think I have the time to maintain an up-to-date list of Nutch crawlers and see what each one of these is doing?

I have a much simple solution and it's powered by the regular expression (Nutch|nutch). And unless Nutch changes its name, this regexp will work forever.

#7 2008-05-01 by Johann

tom,

no, I just block them all.

#8 2008-05-01 by tom

Hi

Just block the bad ones by IP address and regular expression!

#9 2008-05-13 by Tim

Nutch has no configuration settings to turn off it's 'niceness' like obeying robots.txt or not hitting a server multiple times per second, which means that anyone using it to be rude is modifying the source code to do so. It's much easier to change the source to remove 'nutch' from the agent string (yes, you have to modify the source code to do that also) than it is to cause nutch to behave rudely, so in essence by banning nutch with a regex, you're mostly banning people who don't want to go out of their way to _not_ give credit back to the open-source community that developed nutch.

The jerks who misuse the crawler will keep on doing so. As soon as they're being blocked by too many sites for the agent string, they'll just change it. You aren't stopping anyone, you're just preventing the acknowledgment of peoples' hard work for free.

What percentage of agent strings that have hit your site in the last 6 months have nutch in them? 0.01% or so? It's easy to claim nutch is bad because it's easy to search your logs for a single string, it takes more work to actually identify problems. I'll bet more of the crawlers hitting your site are showing up as FireFox or IE and you can't even tell.

#10 2010-01-31 by Gianpaolo

Just a question (and I won't ask you why you removed all my comments in other posts), which one can be a great script to start a search engine? I don't want to be the next google, but I want to index all the websites in my server.

Subscribe

RSS 2.0, Atom or subscribe by Email.

Top Posts

  1. DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
  2. 6 fast jQuery Tips: More basic Snippets
  3. xslt.js version 3.2 released
  4. xslt.js version 3.0 released XML XSLT now with jQuery plugin
  5. Forum Scanners - prevent forum abuse
  6. Automate JavaScript compression with YUI Compressor and /packer/

Navigation