Today we will take a closer look at the robots txt file – what it is, why it is needed and how to work with it. The term robots txt is described on many sites and blogs. However, everywhere articles on this topic differ significantly from each other. But because users are confused in them, like a fish in networks.
Robots txt file – what a terrible beast?
Robots.txt is a file. This is a standard text document saved using the UTF-8 encoding. It is created specifically for working with protocols such as:
The file carries an important function – it is needed in order to show the search robot what exactly needs to be scanned and what is closed from scanning.
All the rules, requirements, recommendations that are indicated in robots.txt are relevant only for a specific host, as well as the protocol and port number, where the file described by us is located directly.
By the way, robots.txt itself is located in the root directory and is a standard text document. Its address is https://admin.com /robots.txt., Where admin.com is the name of your site.
In other files, a special mark Byte Order Mark is put or it is also called the abbreviation for PTO. This mark is a Unicode character – it is required in order to establish a clear sequence of read information in bytes. The character code is U + FEFF.
But at the beginning of our robots.txt, sequential readability is neglected.
We note directly the technical characteristics of robots.txt. In particular, the fact that the file is a description in a BNF form deserves mention. And RFC 822 rules apply.
What exactly and how does the file process?
Reading the commands indicated in the file, search engine robots receive from the following commands for execution (one of the following):
- scanning only individual pages – this is called partial access;
- scanning the entire site as a whole – full access;
- ban on scanning.
When processing the site, robots get certain answers, which may be as follows:
- 2xx – the site was scanned successfully;
- 3xx – the robot goes on forwarding until it succeeds in receiving another answer. In most cases, this requires five attempts in order to find an answer that will differ from 3xx. If no response is received in five attempts, a 404 error will be recorded;
- 4xx – the robot is sure that it should scan the entire site;
- 5xx – such an answer is regarded as a temporary server error, and scanning is prohibited. A search robot will “knock” on a file for so long until it receives an answer. At the same time, a robot from Google evaluates the correctness or incorrectness of answers. In this case, it should be said that if instead of the traditional error 404, a 5xx response is received, then in this situation the robot will process the page with the answer 404.
Robots txt file directives – for what purpose are they needed?
For example, there are situations when it is necessary to limit visits to robots:
- pages containing personal information of the owner;
- pages on which these or those forms for information transfer are placed;
- site mirrors;
- pages that display search results, etc.
How to create a robots txt file: detailed instructions
You can use virtually any text editor to create such a file, for example:
- Notepad;
- Notebook;
- Sublime et al.
This “document” describes the User-agent instruction and also indicates the Disallow rule, but there are other, not so important, but necessary rules / instructions for search robots.
User-agent: to whom it is possible and to whom not
The most important part of the “document” is the User-agent. It indicates exactly which search robots should “look” at the instructions described in the file itself.
There are currently 302 robots. In order not to register each individual robot in a document personally, you must specify the entry in the file:
User-agent: *
This mark indicates that the rules in the file are oriented to all search robots.
Google has the main Googlebot search engine. In order for the rules to be designed only for it, it is necessary to write in the file:
User-agent: Googlebot_
If there is such an entry in the file, other search robots will evaluate the site materials according to their main directives, which provide for the processing of empty robots.txt.
Yandex has the main Yandex search robot and for it, the entry in the file will look like this:
User-agent: Yandex
If there is such an entry in the file, other search robots will evaluate the site materials according to their main directives, which provide for the processing of empty robots.txt.
Other special search robots
- Googlebot-News – used to scan news posts;
- Mediapartners-Google – specially designed for the Google AdSense service;
- AdsBot-Google – Evaluates the overall quality of a specific landing page;
- YandexImages – indexes Yandex pictures;
- Googlebot-Image – for scanning images;
- YandexMetrika – Yandex Metrik service robot;
- YandexMedia – a robot that indexes multimedia;
- YaDirectFetcher – Yandex.Direct robot;
- Googlebot-Video – for indexing videos;
- Googlebot-Mobile – created specifically for the mobile version of sites;
- YandexDirectDyn – a robot for generating dynamic banners;
- YandexBlogs is a blog search robot; it scans not only posts, but even comments;
- YandexDirect – designed to analyze the content of affiliate sites of the Advertising Network. This allows you to determine the theme of each site and more efficiently select relevant ads;
- YandexPagechecker is a micro-marking validator.
We will not list other robots, but, we repeat, there are more than 300 tons in total. Each of them is focused on certain parameters.
What is Disallow?
Disallow – indicates that it is not subject to scanning on the site. In order for the entire site to be open for scanning by search robots, you must insert an entry:
User-agent: *
Disallow:
And if you want the entire site to be closed for scanning by search robots, enter the following “command” in the file:
User-agent: *
Disallow: /
Such a “record” in the file will be relevant if the site is not yet completely ready, you plan to make changes to it, but so that in its current state it does not appear in the search results.
And a few more examples of how to register this or that command in the robots.txt file.
To prevent robots from viewing a specific folder on the site:
User-agent: *
Disallow: / papka /
To block a specific URL from crawling:
User-agent: *
Disallow: /private-info.html
To close a specific file from scanning:
User-agent: *
Disallow: / image / file name and its extension
To close all files of a specific resolution from scanning:
User-agent: *
Disallow: /*. extension name and $ icon (no space)
Allow – a team for guiding robots
Allow – this command gives permission to scan certain data:
- file;
- directives;
- pages etc.
As an example, consider a situation where it is important that robots can only view pages that start with / catalog, and all other content on the site must be closed. The command in the robots.txt file will look like this:
User-agent: *
Allow: / catalog
Disallow: /
Host + to robots txt file or how to choose a mirror for your site
Adding the host + command to the robots txt file is one of several required tasks that you must do first. It is provided so that the search robot understands which mirror of the site is subject to indexing, and which should not be taken into account when scanning pages of the site.
Such a command will allow the robot to avoid confusion in case of detection of a mirror, and also to understand what is the main mirror of a resource – it is indicated in the robots.txt file.
At the same time, the site address is indicated without “https: //”, however, if your resource runs on HTTPS, in this case the corresponding prefix must be indicated.
This rule is written as follows:
User-agent: * (name of the search robot)
Allow: / catalog
Disallow: /
Host: site name
If the site is using HTTPS, the command will be written as follows:
User-agent: * (name of the search robot)
Allow: / catalog
Disallow: /
Host: https: // site name
Sitemap – what is it and how to work with it?
A sitemap is necessary in order to transmit information to search bots that all site URLs that are open for crawling and indexing are located at https://site.ua/sitemap.xml.
During each visit and crawl of the site, the search robot will study exactly what changes have been made to this file, thereby updating the information about the site in its database.
Here’s how to spell these “commands” in the robots.txt file:
User-agent: *
Allow: / catalog
Disallow: /
Sitemap: https://site.ua/sitemap.xml.
Crawl-delay – if the server is weak
Crawl-delay is a necessary parameter for those sites that are located on weak servers. With it, you have the opportunity to set a specific period through which pages of your resource will be loaded.
Indeed, weak servers provoke the formation of delays during access to them by search robots. Such delays are recorded in seconds.
Here is an example of how this command is written:
User-agent: *
Allow: / catalog
Disallow: /
Crawl-delay: 3
Clean-param – if it has duplicate content
Clean-param – designed to “fight” with get-parameters. This is necessary in order to exclude the possible duplication of content, which will eventually be available to search robots at various dynamic addresses. Similar addresses appear if the resource has different sortings or the like.
For example, a specific page may be available at the following addresses:
- www.vip-site.com/foto/tele.ua?ref=page_1&tele_id=1
- www.vip-site.com/foto/tele.ua?ref=page_2&tele_id=1
- www.vip-site.com/foto/tele.ua?ref=page_3&tele_id=1
In a similar situation, the following command will be present in the robots.txt file:
User-agent: Yandex
Disallow:
Clean-param: ref / foto / tele.ua
In this case, the ref parameter shows where the link goes from, and therefore it is written directly at the very beginning, and only after that the rest of the address is written.
What characters are used in robots.txt
In order not to be mistaken when writing a file, you should know all the characters that are used, and also understand their meaning.
Here are the main characters:
/ – it is necessary to close something from scanning by search robots. For example, if you put / catalog / – at the beginning and at the end of a separate directory of the site, then this folder will be completely closed from scanning. If the command looks like / catalog, then all links on the site, the beginning of which is written as / catalog, will be closed on the site.
* – indicates any sequence of characters in the file and is set at the end of each rule.
For example, the entry:
User-agent: *
Disallow: /catalog/*.gif$
Such an entry says that all robots are prohibited from scanning and indexing files with the .gif extension, which are placed in the catalog site folder.
“$” – is used to introduce restrictions on the actions of the * sign. For example, you need to prohibit everything that is in the catalog folder, but you can also not prohibit URLs in which / catalog is present, you must make the following entry:
User-agent: *
Disallow: / catalog?
– “#” – this icon is intended for comments, notes that the webmaster creates for himself or other webmasters who will also work with the site. This icon prevents scanning of these comments.
The record will look like this (for example):
User-agent: *
Allow: / catalog
Disallow: /
Sitemap: https://site.ua/sitemap.xml.
# instructions
Perfect robots.txt file: what is it?
Here is an example of a virtually perfect robots.txt file, which is suitable, if not for everyone, then for many sites.
User-agent: *
Disallow:
User-agent: GoogleBot
Disallow:
Host: https: // site name
Sitemap: https: // site name / sitemap.xml.
Let’s analyze what this robots.txt file is. So, it allows you to index all the pages of the site and all the content that is posted there. It also indicates the host and site map, so search engines will see all addresses open for indexing.
In addition, recommendations for Googlebots are separately indicated.
However, you should not just copy this file for your site. Firstly, for each resource, separate rules and recommendations should be provided. They directly depend on the platform on which you created the site. Therefore, remember all the rules for filling out the file.
Other errors
- Errors in the file name. The name is only robots.txt, but not Robots.txt, not ROBOTS.TXT and in no other way!
- The User-agent rule must be filled in – you must specify either which particular robot should take it into account, or in general.
- The presence of extra characters.
- Present in the file of pages that should not be indexed.
What we learned about the robots txt file
Robots txt file – plays an important role for each individual site. In particular, it is necessary to establish certain rules for search robots, as well as promote your site, company.