May 26, 2008

Wow another corp snoop bot. see


It does not follow robots.txt file so you have to email someone to tell them to stop buring up your bandwidth. Hu?

I really hate these corp PR snoops that think you have to sever content to them.
I wonder if they ever thought about the fact that taking my content and serving it up to subscribers (charging for it) without my permission is a criminal copyright violation.


Anonymous said...

These guys are insidious, blowing right past my robots exclusion file - and with FAQs on their website that I have to email them to get excluded?

Hello, isn't that why we have a robots.txt that reads:

User-agent: *
Disallow: /

Anonymous said...

This robot also comes from these IPs:

Anonymous said...

Mean Dean:

I know I am late to the party and that this post is months old, but I am at war with Radian6 right now and researching them. Mean Dean, Radian6 DOES NOT follow robots.txt so that would do no good, and it would shut out ALL crawlers. Most of us do not want that to happen.

Bottom line: Radian6 bot/crawler/spider is non-compliant and I am convinced it is purposely set that way.

agent_buzz said...

I just did this in ModSecurity:

$ sudo vi modsecurity_crs_35_bad_robots.conf

Put this rule in:

SecRule REQUEST_HEADERS:User-Agent "radian6_default" "phase:2,t:none,t:lowercase,deny,log,auditlog,status:404,msg:'Rogue web site crawler',id:'990012',tag:'AUTOMATION/MALICIOUS',severity:'2'"

Restart Apache:

$ sudo /usr/local/apache2/bin/apachectl restart

There is a how-to for ModSecurity on my blog, if anybody needs it.

Pop Adrian said...

stealing data is a criminal copyright violation; but crawling isn't an infraction; in fact many companies are using crawlers, there are companies who made only crawlers.

there are a lot of reasons for crawling: indexing in search engine (the main), grabbing an company ads, and including in other sites (example: i goat an site with real estate ads, very visited; in this way, a real estate company wish to publish it's ads on my site, so I made an offer to this company, that includes an regular crawler that runs on his site, and collect his ads). There are also many reasons for crawling.

Here I try to speak as an data provider, who has the main activity, called Crawling.
Of course that when a robot affects the good behavior of your site, you can accuse the robot owner.

tmaster said...

Its clear you don't understand whats going on.

If a site is put up for users and authorized bots only and you pretend to be a user to gain access you are in violation of US law because you have used false information to gain access to a computer system.

It has nothing to do with copyright law.

Copyright violation comes in when you republish the data as the Corp Snoops are doing. So they are in fact violating 2 laws.

Anonymous said...

I've got the last spider on my blog called: R6_FeedFetcher( too.

So what the best robots.txt to protec theme?

Anonymous said...


Yep, we've also seen this. I must confess that I haven't tried robots.txt, as most people say it ignores it, so we use the "user agent" string to serve-up an error page (403).

We use Apache 1.3, so you have to amend/ create the .htaccess file in the root of the dir you are interested (probably your html root).

Example .htaccess:

RewriteCond %{HTTP_USER_AGENT} ^R6_
RewriteRule .* - [F]

This basically says: "where the user agent starts: "R6_", then serve up an error page

You can use reg expressions, but use sparingly, as you don't want to evict the good web browsing folk!

Hope this helps!

Bodzio vel Ten-co-wie-jak-każdy-pączek-smakuje. said...

I've read one on the net, that this may not work either. They will change their signature to some meaningless value. How about putting their ip range to a s...t hole with
Deny from
in your .htaccess?
This should put them at "ease".

Bodzio vel Ten-co-wie-jak-każdy-pączek-smakuje. said...

Update: After few days of being rejected from their 142.166.x.x. range, now they moved to a new one:
Anyway, so if you want to block them update your .htaccess with the following:

deny from
deny from
These bastards eat up a LOT of bandwidth.
I am thinking of rerouting them to some porn site instead of simple blocking.