Robots or Spiders are used to visit web sites to discover and analyse the content. Dependant upon the owning website the content discovered may be either indexed for use by search engines or for gathering e-mail addresses.
Search engines will look in your root domain for a file called robots.txt. This file tells the robot which files it is allowed to index. This may be individual files, or directories. The problem with this mechanism is that adding an entry which says don't index this directoy is too much of a temptation for the curious, its a good place to start for those interested in causing mischeif on your site. Not all robots obey the laws of the robots.txt file.
The robots.txt file consists of records, each of two fields : a User-agent line and one or more Disallow: lines. The format is:
":"
User-agent
The User-agent line specifies the robot, for example:
User-agent: googlebot
The wildcard charcter * may also be used to specify all robots:
Disallow
The second part of a record consists of the disallow directive lines. These lines specify files and/or directories which are not to be indexed, for example:
Disallow: /cgi-bin/
Will stop spiders from indexing the cgi-bin directory. If the disallow line is left blank then all files on the server may be eretrieved.
Comments
Any line in the robots.txt that begins with # is considered to be a comment only.
| This article viewed: 1388 times | Back |
Copyright © 2004-2007 Janet Systems Ltd.