robots.txt question

Jack quiet_celt at yahoo.com
Mon Jan 17 06:26:59 CST 2011


Let me explain a bit.

To exclude all robots, that respect the robots.txt file:
User-agent: *

Disallow: /
To exclude just one directory and its subdirectories,
say, the /aggregator/ directory:
User-agent: *

Disallow: /aggregator/
To  disallow specific
robots you need to know what it calls itself, ia_archiver is the wayback machine

To allow the Internet Archive bot you'd make a line like this:
User-agent: ia_archiver

Disallow:
To block ia_archiver from
visiting:
User-agent: ia_archiver

Disallow: /

You can have as many lines like this as you want. So you can disallow all robots from everywhere, and then allow only those you want. You can block certain robots from certain parts. You can block directories and sub directories or individual files.. If you have numerous "aggregator" files in various subdirectories you want to block you need to list them all.

Like this:

User-agent: *
Disallow:/aggregator/
Disallow:/foo/aggretator/
...

Disallow:/hidden/aggregator/

Your syntax looks wonky, missing the final "/". 
User-agent tells who to block and Disallow what to block. This all assumes well behaved robots. This file is useless for those that ignore this file. It is not a security device, just a polite sticky note. 

You might go here for more detailed info. I'm no expert for sure.

http://www.robotstxt.org/orig.html

Jack

--- On Sun, 1/16/11, Jonathan Hutchins <hutchins at tarcanfel.org> wrote:

From: Jonathan Hutchins <hutchins at tarcanfel.org>
Subject: robots.txt question
To: "KCLUG (E-mail)" <kclug at kclug.org>
Date: Sunday, January 16, 2011, 12:53 PM

I'm wondering about the syntax.  The example file from drupal uses the format

Disallow: /aggregator

However, it says in the comments that only the root /robots.txt file is valid.  

>From my understanding of the syntax, /aggregator does not 
block /foo/aggregator, so I need to either prepend "/foo" to everything, or 
use wildcards per the new google/webcrawler extensions to the protocol.

If anybody can cite an on-line example that explains I'd be grateful.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://kclug.org/pipermail/kclug/attachments/20110117/5c1494a6/attachment.htm>


More information about the KCLUG mailing list