panscient.com = still bad bot

Posted 2007-06-07 in Spam by Johann.

Just when the alleged CEO of panscient.com says their crawlers currently obey the standard robots protocol, I find the following in my logs:

$ grep "\"panscient.com\"" <logfile>
38.99.203.110 johannburkard.de [02/Jun/2007:06:56:54 +0200] "GET /?johannburkard.com HTTP/1.1" 403 3778 "-" "panscient.com" "-"
$

What were they looking for here? That I registered a .com of my name? I have, thanks. How did they find the domain? Last time I checked there were no links to johannburkard.com.

Of course it still does not ask for robots.txt. Yawn…

15 comments

#1 2007-07-26 by Dan

What is their purpose? Are they hired to collect data about particular domains. I have many domains they've never looked at, but I have two similar domains that were addressed by them.

#2 2007-10-29 by jasper

What is interesting is that the bots first try running under "Java/1.6.0_02" (or similar iteration of java). If they are denied, the bots switch to "panscient.com" as their agent header.

#3 2008-01-03 by Jeremy Daley

yeah, my site is not linked from anywhere either, yet somehow they found it and are sending blank POST's to a PHP script of mine.

#4 2008-01-24 by Jonathan Baxter

Hi, the panscient bots had a "feature" whereby the original robots.txt request was sent with a "java" user-agent instead of panscient.com (different piece of code handling the robots parsing, which was not updated to use the panscient.com user-agent when the bots came out of beta).

This had the unfortunate effect of making it look like we ignored robots.txt, when in fact we obey all robot directives whether in meta-tags or in robots.txt. The problem has now been fixed.

If you have any problems at all with our bots please email crawler@panscient.com. Your complaint will be addressed (eg, the robots.txt user-agent issue was fixed as a result of an email to crawler@panscient.com).

Thanks,

Jonathan Baxter
CEO, Panscient

#5 2008-01-24 by Johann

And that took you five months to do, Jonathan?

Yeah, right.

Plus other people still say panscient does not obey robots.txt.

#6 2008-01-24 by Johann

Dan,

I think they are just looking for people's names and what they do. Basically.

Oh, and maybe some brand monitoring: Panscient primarily crawls the web looking for corporate information, such as company names.

#7 2008-01-24 by Dan

Thanks for the info.
Dan

#8 2008-07-25 by Dave84620

Hi,
I also found this entry in our web logs, but i can say that this crawler did download the robots.txt and nothing else after that.
Perhaps it was something different in 2007 but i can't tell anything bad about it.

Dave

#9 2008-08-23 by Bob Kaufman



Oh, and it never asked for robots.txt

#10 2008-10-08 by Johann

sagor,

was that with a Disallow entry for panscient in robots.txt? Interesting...

#11 2008-10-08 by sagor

I see as of today, they tend to get robots.txt and stop there, but not really. They seem to be fishing for my phpBB3 as well. Why would a bot look for a forum if it was legit?

38.104.58.118 - - [04/Oct/2008:05:15:21 -0400] "GET /phpBB3/ HTTP/1.1" 200 21345 "-" "panscient.com"

#12 2009-08-25 by Greg B

Yeah... I saw some pretty questionable stuff from panscient.com myself recently..... a lot of 404 errors looking for stuff like:

/text/javascript
/text/html
/Opera/
/ f!==
/application/x-www-form-urlencoded

At first glance, it looks like a hacker attack looking for ways into my web system.

Their spider is obviously not just "Crawling links" if it's attempting to go to pages that I *know* do not exist, and generating all of those 404 errors.... *VERY* suspicious in my book.

#13 2009-11-04 by Desmond O'Toole

Hi I got panscient.com sun 1st Nov 2009. They hit every single page and ignored my robots file.

#14 2009-11-18 by Steve Moss

The panscient bot also ignored my robots.txt. This disallows a login page, which is not worth crawling, but panscient went & crawled it anyway.

#15 2009-11-18 by Johann

Guys,

please no more "I saw pancient.com in my logs" comments. Everybody knows it's a bad bot and should be blocked. If you haven't blocked panscient yet, go do it.

Subscribe

RSS 2.0, Atom or subscribe by Email.

Top Posts

  1. DynaCloud - a dynamic JavaScript tag/keyword cloud with jQuery
  2. 6 fast jQuery Tips: More basic Snippets
  3. xslt.js version 3.2 released
  4. xslt.js version 3.0 released XML XSLT now with jQuery plugin
  5. Forum Scanners - prevent forum abuse
  6. Automate JavaScript compression with YUI Compressor and /packer/

Navigation