• Home
  • Free Tutorials
  • Free Video Tutorials

Adobe Photoshop
Autodesk 3ds Max
CSS
Flash
Help Center Live
iPhone
Joomla
Make Money Online!
Mambo
mIRC
Outlook Express
SEO / SEM
Simple Machine Forum
Target Explorer

Use of Robots.txt

Category : SEO / SEM Views : 2547
Version : Rating : 
2.6/5 (59 votes)
  • Currently 2.61/5
  • 1
  • 2
  • 3
  • 4
  • 5
Type : Text


That's nice when you see spider viewing your changes frequently but there are the cases when the results you see in the index of search engine, not your favorite. I am elaborating it through an example; just imagine you have two parts of your webpage. One of them would be for viewing and other would certainly for printing. You need to exclude printing version from the crawling otherwise that may result you in duplicate penalty. There are certain cases when you don't want others to view your sensitive data (that can be your company budget you want to show only for your virtual directors) or anything that is super sensitive. That is the data which you don't like to be indexed by any search engine. We all know that we can save sensitive data in an offline mode at any other machine in the data center. There may be other reasons like you want to save some bandwidth by removing images, style sheets and other scripts from indexing, you must tell the spider to keep yourself away from these particular items. The first way or 1 way to tell engine that which files and folders you have not to index is to use Robot Meta tags. As, all the search engines don't like to read Meta tags' all what they do is that they avoid Meta tags. But the best way to tell spider not to index your page or website is the use of Robots.txt.

The robots.txt file is an ASCII text file having specific instructions for engines robots about specific content that they are not allowed to index and rank Robots.txt is a text file you put on your site to tell search robots which pages you would like them not to even visit. Robots.txt is by no means mandatory for search engines (they may avoid it) but normally search engines obey what they are asked not to perform. You are expected to keep in mind that robots.txt is not a firewall like after installing that file you may prevent from search engine crawling. Installing robots.txt you put a not to engine that don't you please enter. I want to clear you that you cannot prevent from thieves putting notes in practical life. Same way in technology world that is possible to enter and index your page without caring your notes (robots.txt. I would say if you have really sensitive data then don't rely on robot.txt that you could protect yourself from being indexed / displayed in search results.

What the more important is location of robots.txt. I recommend you to please put in main directory otherwise search engine cannot find that has installed it with your domain.

Interested people to please visit www.robotstxt.org for further information where they defined structure.

To learn more

It should be clear that to use robots.txt is simple. You may not require access to your server's root location. For example, if below is the path of your file:
http://yourdomain.com/yoursitesite/index.html

You will need to be able to create a file located here:
http://yourdomain.com/robots.txt

There is another case, you cannot access you root location you may not be able to use robots.txt to ignore pages from your index.

Consider another example:
user-agent: Free-Find disallow: /your-site/test/ disallow: /your-site/cgi-bin/post.cgi?action=reply disallow: /a

Following addresses would be ignored by the spider:
http://yourdomain.com/yoursitesite/test/index.html http://yourdomain.com/your-site/cgi-bin/post.cgi?action=reply&id=1 http://yourdomain.com/your-site/cgi-bin/post.cgi?action=replytome http://yourdomain.com/xyz.html

Following would be allowed:
http://yourdomain.com/your-site/test.html http://yourdomain.com/your-site/cgi-bin/post.cgi?action=edit http://yourdomain.com/your-site/cgi-bin/post.cgi http://yourdomain.com/uut.html

It is also possible to use an "allow" in addition to disallows. For example:
user-agent: Free-Find disallow: /cgi-bin/ allow: /cgi-bin/Ultimate.cgi allow: /cgi-bin/forumdisplay.cgi
You may not expect from engines to browse for your whole directory. Robots.txt file prevents the spider from accessing every cgi-bin address from being accessed except Ultimate.cgi and forumdisplay.cgi.
Using 'allow' can often simplify your robots.txt file.

Look at this one that shows a robots.txt with two sections in it. One is for "all" robots, and other is for the Free-Find spider:
user-agent: * disallow: /cgi-bin/ user-agent: Free-Find disallow:
In this example all robots except the Free-Find spider will be prevented from accessing files in the cgi-bin directory. Free-Find will be able to access all files (a disallow with nothing after it means "allow everything").

There are many tools available to use to generate robots.txt at your main directory. If you have basic syntax of robots.txt you can always read it if every thing is going fine, there is another case and that is more easier i-e use validator like http://tool.motoricerca.info/robots-checker.phtml. These are the tools those help you in common errors like your semicolon or colon is missing. Lets examine the following example. You have typed:
User agent: * \n Disallow: /temp/
This would be considered as wrong as you have not included slash between user and agent and this is wrong syntax. There are many cases when you have complex/complicated robots.txt file that is you instruct for different instructions to perform tasks by user agent or you have a long number of subdirectories and directories to exclude, manually to perform tasks that would certainly a painful to everyone. Well, there are many tools available to perform same operations for use, thus, saving your time and get you relax. More interesting is that, there are many visual tools available that help you in selecting which files and folders to exclude. If you are not interested in purchasing a graphical tool to generate robots.txt, you can use online tools for your assistance. http://www.submitcornet.com/tools/robots.txt/Server.shtml is an example link and it will appear a drop down menu for user agent from where you can select which files and folders you are not willing to include. This is not a perfect help, unless you set some specific rules for different engines.

Engines normally look first main directory (http://www.yourdomainname.com/robots.txt) and if that is not present, engines assume that you don't have robots.txt.

If you are interested to view the history for robots.txt please hit below link.
http://www.whitehouse.gov/robots.txt


del.icio.us digg it Reddit Stumble Upon Technorati
How to Video Tutorials on software by Helpvids.com

Video tutorial: Forex trading



Sponsors



Advertisement









Studio | Advertisement | About Webzo | Contact Webzo | Terms of Use | Free Video Tutorials by Helpivds

Copyright © 2007 NR Concepts Ltd. All rights reserved.