robots.txt question

Mon Jan 17 11:14:43 CST 2011

Actually, I was wrong in my previous email.

"Disallow: /aggregator/" should block any directory at any level that is "aggregator".

The syntax, "Disallow: /aggregator" should block access to any aggregator directory and any aggregator.html files.

Without the slash you block directories and html pages by the name. With the final slash you block just directories.
There's no reason to add more than one robots.txt file. You should only have one, and put all your rules in there.  

/aggregator/ should block :
/aggragator/,
/foo/aggregator/,
/long/deep/path/to/obscure/folder/aggregator/,
etc.

You can test all of this for yourself. you can use wget to download from your site as a robot. You can also make wget ignore the robots.txt file. You can have wget pretend to be any robot you like, or even make it your own robot, that you allow to mirror your page.

Caveat, if you make wget ignore the robots.txt file you should also add a pause to it so you don't hammer the site you are downloading/mirroring. Some sites specifically disallow wget in recursive mode, to keep the site from getting hammered by downloads.

Jack

--- On Sun, 1/16/11, Jonathan Hutchins <hutchins at tarcanfel.org> wrote:

From: Jonathan Hutchins <hutchins at tarcanfel.org>
Subject: robots.txt question
To: "KCLUG (E-mail)" <kclug at kclug.org>
Date: Sunday, January 16, 2011, 12:53 PM

I'm wondering about the syntax.  The example file from drupal uses the format

Disallow: /aggregator

However, it says in the comments that only the root /robots.txt file is valid.  

>From my understanding of the syntax, /aggregator does not 
block /foo/aggregator, so I need to either prepend "/foo" to everything, or 
use wildcards per the new google/webcrawler extensions to the protocol.

If anybody can cite an on-line example that explains I'd be grateful.
_______________________________________________
KCLUG mailing list
KCLUG at kclug.org
http://kclug.org/mailman/listinfo/kclug
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://kclug.org/pipermail/kclug/attachments/20110117/d5770c60/attachment.htm>