Web robots are automated bits of code coming from different websites that skim through the Internet in an attempt to find and process data from other sites. One of the most prevalent examples of these robots are in the form of search engine crawlers. However, for the sake of privacy or other reasons, web masters have the ability to hide some of their sites’ contents from these robots. This is through the use of Robots.txt Generator Tool to create robots.txt, also known as the “robots exclusion protocol”.
When a robot wishes to visit a site, one of the first things it sees is the robots.txt file that is made with Robots.txt Generator Tool. This is located at the root of the hierarchy, and contains an index to the different files and directories that the crawlers are not allowed to process.
The file syntax goes like this:
User-agent: *
Disallow: /
The user-agent is the designation that defines which to which robots the exclusion will apply. Here, it is designated as a wildcard (*). The disallow section will define which pages, files, or directories the robot is not allowed to visit. Here, the “/” denotes all files and directories -- in short, the crawler will not be allowed to visit any page at all.
First things first -- despite being a “protocol”, following the indications of the robots.txt file is not mandatory. The protocol is de facto and is not owned by a standards body, so universal compliance cannot be expected. Malware robots, for example, scan the Internet for any vulnerabilities and may use the robots.txt file made by Robots.txt generator tool as a directory of which pages it will visit first. Email harvesters and spambots will also ignore the robots.txt file.
As well, the robots.txt file that made via Robots.txt generator tool is not an effective way to hide information. The file is publicly available, and this can be seen by anyone.
If you want to really hide a page without exposing it in the robots.txt file, then you can use the following workaround:
1. Place all files you want excluded from crawls in a separate sub-directory. Configure your server in such a way that this becomes un-listable on the web.
2. Fill up all of necessary requirements for Robots.txt Generator Tool and list only the name of this directory in the robots.txt file. Through this, even malicious robots will not be able to locate this, unless you or another user will place a direct link to these files on the web.
For creating using Robots.txt Generator Tool just fill up all those necessary fields and provide your sitemap url then hit create robot.txt.
Robots.txt protocol is a single text file, usually containing a single record. The file follows the same syntax as the one discussed above, but each exception will have to be placed in its own disallow line. For example:
User-agent: *
Disallow: /~me/
Disallow: /tmp/
Disallow: /cgi-bin/
You should not leave any blank lines in the record, since such a line is used to mark boundaries between multiple records.
The file can be created using any software that creates a txt file, such as Windows’ Notepad and Wordpad or Mac’s TextEdit (saved as Western, and formatted under “Make Plain Text”). On Linux, several variants are available including vi and emacs.
Simply put, the robots.txt file that you created and downloaded from Robots.txt Generator Tool and placed in the top-level directory of the web server. When a robot looks for the file, it will strip the path component from the URL and instead replace it with “/robots.txt”. As the webmaster, this means you need to place the file where it will work, which is the same place where the index.html file is located. Also, remember that you need to name the file correctly (all lowercase). When done correctly, this can go a long way in helping your website by helping you control what search engines see when they index you.