How
to implement robots.txt file for better crawling?
There
is an out of sight, persistent force that permeates the web and its
number of web pages and files, unbeknownst to the majority of us
attentive beings. We are talking about search engine crawlers and robots
here. Daily thousands of them go out and polish the web, whether its
search engine trying to index the entire web, or a spam grabbing any
email address it could find for less than worthy intentions. As web
developers, what little control you have over what robots are permitted
to do when they visit your sites exist in a miraculous small file called
"robots.txt."
Robots.txt is a text file which has been red by search engines while
crawling. This is the hidden file for the users. Search engines read
this file for better crawling when it comes to your site. Usage of this
file is, to give certain instructions to the robots. You can give
command like what should be allowed to crawl and what should not be
allowed to crawl. Sometimes you don’t want to fetch some pages by search
engines like msn, Google and yahoo.
By
defining a few rules in this text file, you can inculcate robots to not
crawl and index certain files, folders within your site, or at all. For
example, you may not want a search engine to crawl the “images” folder
of your website, as it's both worthless to you and a waste of your
site's bandwidth. "Robots.txt" lets you tell search engines just that.
"robots.txt"
file creation and implementation
Create
a regular text file called "robots.txt", and make sure it's named
exactly that. This file should be uploaded to the root directory of your
website, not a subfolder. It is only by following the above two rules
will search engines interpret the instructions contained in the file.
Move away from this, and "robots.txt" becomes nothing more than a normal
text file.
Now
you have learned what to name your text file and where it should be
uploaded, you have to learn what to actually put in it to send commands
off to search engines that follow this protocol. The structure is
trouble-free for most intents and purposes: a USER-AGENT line to
recognize the crawler in question followed by one or more DISALLOW:
lines to disallow it from crawling certain parts of your website.
1.Normal "robots.txt":
User-agent: *
Disallow: /
The
symbol ‘*’ means “all files” commonly. In the above 2 lines, first line
instructs crawler to crawl all files and folders. Second line instructs
crawler to crawl nothing on the website.
2)
Let us get a little more inequitable now. While every one likes Google,
you may not want Google's Image robot crawling your site's images and
making them searchable online, if just to save bandwidth. The following
code will achieve the technique:
User-agent: Googlebot-images
Disallow: /
3)
The following coding disallows all search engines and robots from
crawling select directories and pages:
Here
we have programmed in a tremendous way. Google only read the folders
Images and uploads.
Meta tag instructions for robots:
If we
don’t have any robot.txt file in root path, then we have to control page
access using Meta tags. For smooth access we can instruct through Meta
tags on the pages to crawl better.
1.< meta name=”robots” content=”index”>
This will allow the crawler to index this page.
2.< meta name=”robots” content=”noindex”>
This will not allow the crawler to index this page.
3.< meta name=”robots” content=”follow”>
This will allow the crawler to follow all the links in this page.
4.< meta name=”robots” content=”nofollow”>
This will not allow the crawler to follow all the links in this page.
5.< meta name=”robots” content=”none”>
This will not allow the crawler to neither index nor follow all the
links in this page.