Nov 24, 2006

www.exalead.com Violates robots.txt

The www.exalead.com website has a robot that comes in and ignores your robots.txt file and takes a snapshot of your website and then post it as a thumbnail on its site.
It doesn't matter if you do block all images from bots like this.

User-agent: *
Disallow: /images/

Exalead.com refuses to abide by the commands in robots.txt.

5 comments:

ExaleadGuy said...

A lot of webmasters do not want their images to be indexed by search engines but want them to be included in the thumbnails of their pages.
The robots.txt is actually used by webmasters in most cases to prevent indexation not download.
On the other hand, we understand that some webmasters do not want their images used even for thumbnails, that's why we have decided to give them a way to do it by the use of :
meta name="robots" content="nothumbnail"
If the usage of the thumbnails tend to generalize in search engines, the webmasters will certainly take it into account in the writing of their robots.txt file and we will be able to use it again but for the moment it would be deceptive both for webmasters and end users.
Hope you will understand our point of view.
Regards

tm said...

No we do not understand.

The existing standard allows webmasters to allow your bot to load images and tell everyone else not to. I see no confusion.

What you are trying to do is force your thumbnail system on everyone and require them to optout even if they have already blocked bots from loading images.

You are violating the robots.txt rules.

You know verry well you can add your own commands like 'nothumbnail' to the robots.txt file.

ExaleadGuy said...

I would totally agree with you if every webmaster in the world did know Exalead. I would like it but unfortunately, it is still not the case.

Let's take a very basic situation, Mr Foo a fan of Exalead really likes the site of Mr Bar who do not kow Exalead. Mr Bar has copyrighted images on its site and he wouldn't like his images to be accessible through Google however he doesn't mind his pages to be thumbnailed and even he prefers because it makes his site more fun and more attractive to his visitors.

He is not really familiar with search engines but a friend of his has told him that he certainly should prevent the indexation of these images by adding
User-Agent: *
Disallow: /images
Mr Bar found it was a good idea and did it but he didn't knew that doing so he
was also preventing the thumbnails bot to generate a "correct" thumbnail out of his pages.
And Mr Bar made Mr Foo angry and lost a visitor.

And there are a lot of Mr Bar and Mr Foo...

To summarize, in common understanding, robots.txt apply to indexation.
Thumbnails is a new application on the web, that's why optout is used like Google did with the noarchive attribute which is also an optout because the concept of archiving was new and only few webmasters did know about it at this time.

If we do optin, we will receive far more feedback of angry webmasters who found that we made a very poor thumbnail out of their because we did not include their images even if it was because THEY specified it because they did not understand at the time of the redaction of their robots.txt file all the implications it could have.

Thanks for having read this long post :)

Is our point clearer to you ?

tm said...

I fail to see how google using the NOARCHIVE meta tag violates the robots.txt format.

NOARCHIVE is a data storage command it has nothing to do with allowing a bot to read data.


Any way why would I want to use robots=nothumbnail when I might want thumbnails on other sites but not exelead.

I fail to see how any of this has to do with why you dont allow nothumbnail in robots.txt

Fran├žois said...

@ExaleadGuy

We have other things to do than working to deal with your useless bot.

I am not going to add a line in my pages to tell your robot to go elsewhere.

It is your job to follow the rules in robots.txt

Until your understand that and comply, the only solution to deal with you is banning you IP addresses:

deny from 193.47.80.0/24