I'm wondering about the syntax. The example file from drupal uses the format
Disallow: /aggregator
However, it says in the comments that only the root /robots.txt file is valid.
From my understanding of the syntax, /aggregator does not
block /foo/aggregator, so I need to either prepend "/foo" to everything, or use wildcards per the new google/webcrawler extensions to the protocol.
If anybody can cite an on-line example that explains I'd be grateful.
Let me explain a bit.
To exclude all robots, that respect the robots.txt file: User-agent: *
Disallow: / To exclude just one directory and its subdirectories, say, the /aggregator/ directory: User-agent: *
Disallow: /aggregator/ To disallow specific robots you need to know what it calls itself, ia_archiver is the wayback machine
To allow the Internet Archive bot you'd make a line like this: User-agent: ia_archiver
Disallow: To block ia_archiver from visiting: User-agent: ia_archiver
Disallow: /
You can have as many lines like this as you want. So you can disallow all robots from everywhere, and then allow only those you want. You can block certain robots from certain parts. You can block directories and sub directories or individual files.. If you have numerous "aggregator" files in various subdirectories you want to block you need to list them all.
Like this:
User-agent: * Disallow:/aggregator/ Disallow:/foo/aggretator/ ...
Disallow:/hidden/aggregator/
Your syntax looks wonky, missing the final "/". User-agent tells who to block and Disallow what to block. This all assumes well behaved robots. This file is useless for those that ignore this file. It is not a security device, just a polite sticky note.
You might go here for more detailed info. I'm no expert for sure.
http://www.robotstxt.org/orig.html
Jack
--- On Sun, 1/16/11, Jonathan Hutchins hutchins@tarcanfel.org wrote:
From: Jonathan Hutchins hutchins@tarcanfel.org Subject: robots.txt question To: "KCLUG (E-mail)" kclug@kclug.org Date: Sunday, January 16, 2011, 12:53 PM
I'm wondering about the syntax. The example file from drupal uses the format
Disallow: /aggregator
However, it says in the comments that only the root /robots.txt file is valid.
From my understanding of the syntax, /aggregator does not
block /foo/aggregator, so I need to either prepend "/foo" to everything, or use wildcards per the new google/webcrawler extensions to the protocol.
If anybody can cite an on-line example that explains I'd be grateful.
Thanks for at least trying. Yes, I'm aware of robotstxt.org. My question was not regarding user-agents, but about whether a string such as /foo blocks only http://website.org/foo, or if it would block http://website.org/blah/foo. I think it would not, but I'm looking for documentation, references, or examples that specifically address that.
From reading Google's reference on robots.txt as extended by Google,
Microsoft, and Yahoo, the most obvious interpretation would be that I need to block /*foo, but again, no specific confirmation.
As for the trailing slash, again, no specific reference to that format so far.
BTW for anybody interested in the legal relevance of a /robots.txt file (basically zilch) please see http://www.robotstxt.org/faq/legal.html
Before, anyone gets too cocky on the legal ramifications of robots.txt files. Let me just warn you, the person who wrote that piece is not a lawyer. It would still be irrelevant if he were. There is nothing preventing legal authorities charging you with the CFAA Act because you violated a robots.txt file. What a judge will decide is yet another unknown.
Now you may say or experts may say that they'd have no case. That may all be true, but if you get arrested, you will: have an arrest record, lose time fighting a open and shut case, spend a bunch of money on an attorney to defend yourself, most likely not have the same kind of lawyers as Google, and possibly lose the first case and go to jail. Not to mention a whole lot of stress and other intangibles.
Do not get you legal advice from friends and strangers on the Internet. Talk to a lawyer if you have questions.
Jack
--- On Mon, 1/17/11, Jonathan Hutchins hutchins@tarcanfel.org wrote:
From: Jonathan Hutchins hutchins@tarcanfel.org Subject: Re: robots.txt question To: "Kclug" kclug@kclug.org Date: Monday, January 17, 2011, 2:50 PM
BTW for anybody interested in the legal relevance of a /robots.txt file (basically zilch) please see http://www.robotstxt.org/faq/legal.html
Actually, I was wrong in my previous email.
"Disallow: /aggregator/" should block any directory at any level that is "aggregator".
The syntax, "Disallow: /aggregator" should block access to any aggregator directory and any aggregator.html files.
Without the slash you block directories and html pages by the name. With the final slash you block just directories. There's no reason to add more than one robots.txt file. You should only have one, and put all your rules in there.
/aggregator/ should block : /aggragator/, /foo/aggregator/, /long/deep/path/to/obscure/folder/aggregator/, etc.
You can test all of this for yourself. you can use wget to download from your site as a robot. You can also make wget ignore the robots.txt file. You can have wget pretend to be any robot you like, or even make it your own robot, that you allow to mirror your page.
Caveat, if you make wget ignore the robots.txt file you should also add a pause to it so you don't hammer the site you are downloading/mirroring. Some sites specifically disallow wget in recursive mode, to keep the site from getting hammered by downloads.
Jack
--- On Sun, 1/16/11, Jonathan Hutchins hutchins@tarcanfel.org wrote:
From: Jonathan Hutchins hutchins@tarcanfel.org Subject: robots.txt question To: "KCLUG (E-mail)" kclug@kclug.org Date: Sunday, January 16, 2011, 12:53 PM
I'm wondering about the syntax. The example file from drupal uses the format
Disallow: /aggregator
However, it says in the comments that only the root /robots.txt file is valid.
From my understanding of the syntax, /aggregator does not
block /foo/aggregator, so I need to either prepend "/foo" to everything, or use wildcards per the new google/webcrawler extensions to the protocol.
If anybody can cite an on-line example that explains I'd be grateful. _______________________________________________ KCLUG mailing list KCLUG@kclug.org http://kclug.org/mailman/listinfo/kclug