Padding Padding
Janet
Systems
 Register  Login 
Telephone  01628 566 178
Website Design, Hosting and Software Services
  Search 
Padding
corner Padding corner
Padding Padding
corner Padding corner
 Web Robot Articles

Robots or Spiders are used to visit web sites to discover and analyse the content. Dependant upon the owning website the content discovered may be either indexed for use by search engines or for gathering e-mail addresses.

Search engines will look in your root domain for a file called robots.txt. This file  tells the robot which files it is allowed to index. This may be individual files, or directories. The problem with this mechanism is that adding an entry which says don't index this directoy  is too much of a temptation for the curious, its a good place to start for those interested in causing mischeif on your site.  Not all robots obey the laws of the robots.txt file.

The robots.txt file consists of records, each of two fields : a User-agent line and one or more Disallow: lines. The format is:

":"

User-agent
The User-agent line specifies the robot, for example:

User-agent: googlebot

The wildcard charcter * may also be used to specify all robots:

Disallow
The second part of a record consists of  the disallow directive lines. These lines specify files and/or directories which are not to be indexed, for example:

Disallow: /cgi-bin/

Will stop spiders from indexing the cgi-bin directory. If the disallow line is left blank then all files on the server may be eretrieved.

Comments
Any line in the robots.txt that begins with # is considered to be a comment only.


This article viewed: 1388 times Back

Copyright © 2004-2007 Janet Systems Ltd.

 Print   

Custom DotNetNuke skins
eCommerce
Dedicated servers from Janet Systems
Padding
Copyright 2002-2008 Janet Systems Ltd.   Terms Of Use  Privacy Statement Thursday, July 24, 2008