How To Use Robots.Txt File?

Robots.txt file usage is sometimes ignored. On the other hand, it is an important factor for the webpages being indexed properly and very easy to setup.

I know that robots.txt is not something new. But, I’ve been preparing a SEO sheet for a while and wanted to share this small & useful portion with you.

What is robots.txt?

Robots.txt is a file that is used to exclude content from the crawling process of search engine spiders / bots. Robots.txt is also called the Robots Exclusion Protocol.

Why to use robots.txt?

In general, we prefer that our webpages are indexed by the search engines. But there may be some content that we don’t want to be crawled & indexed. Like the personal images folder, website administration folder, customer’s test folder of a web developer, no search value folders like cgi-bin, and many more. The main idea is we don’t want them to be indexed.

Is robots.txt file a certain solution?

No. Standards based bots like Google’s, Yahoo’s or other big search engine’s robots listen to your robots.txt file. This is because they are programmed to. If configured so, any search engine bot can ignore the robots.txt file. Result: there is no guarantee.

How to use robot.txt file?

Robots.txt file has some simple directives which manages the bots. These are:

  • User-agent: this parameter defines, for which bots the next parameters will be valid. * is a wildcard which means all bots or Googlebot for Google.
  • Disallow: defines which folders or files will be excluded. None means nothing will be excluded, / means everything will be excluded or /folder name/ or /filename can be used to specify the values to excluded. Folder name between slashes like /folder name/ means that only folder name/default.html will be excluded. Using 1 slash like /folder name means all content inside the folder name folder will be excluded.

There are also some other parameters which are only supported by all browsers. These are:

  • Allow: this parameter works just the opposite of Disallow. You can mention which content will be allowed to be crawled here. * is a wildcard.
  • Request-rate: defines pages/seconds to be crawled ratio. 1/20 would be 1 page in every 20 second.
  • Crawl-delay: defines howmany seconds to wait after each succesful crawling.
  • Visit-time: you can define between which hours you want your pages to be crawled. Example usage is: 0100-0330 which means that pages will be indexed between 01:00 AM – 03:30 AM GMT.
  • Sitemap: this is the parameter where you can show where your sitemap file is. You must use the complete URL addres for the file.

Robots.txt example:

User-agent: * #allows all search engine spiders.
Disallow: /secretcontent/ #disallow them to crawl secretcontent folder.

Resources:
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40360
http://www.robotstxt.org/
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt

  • http://www.creativebrain.web.id creativebrain

    hehe, I only know disallow and allow parameter.. but there are other parameters.. thanks for your post 😀

    btw, how to use Request-rate, Crawl-delay, Visit-time, and Sitemap parameter? can u give me some example?

  • http://franksite1.com Franksite1

    Hey thanks for this information. It has really helped me out in figuring out the first steps in removing my old website. Thanks again! :)

  • http://www.bestalljobs.com Ganeshbabu

    can you explain some more examples to allowed contents and disallowed contents. that must very use ful to beginner(Robots.txt)

  • http://puneetq3tech.wordpress.com John Chris

    How to generate robots.txt and where to place this file in site coding..?

  • http://www.webresourcesdepot.com Umut M.

    @John,
    It is just a text file that you can create and name robots.txt + place into the website’s root folder.

  • http://www.jrrmaster.com/ JrrMaster

    This is some example using robots.txt file

    User-agent: *
    Disallow: /
    Allow: /private
    Request-rate: 1/10
    Crawl-delay: 10 #means after one request crawl successfully executed, the bot should wait to crawl after 10 seconds for next crawling.
    Visit-time: 0100-0330
    Sitemap: http://show/your/url/to/sitemap.php

  • http://sklepkarm.pl karma dla psow

    hi!
    I have one question? How to make a filter for category group for shop, but not for prducts in this category (disallow).

Search