Blog Technical SEO What Is Robots.txt File and How to Configure It Correctly

What Is Robots.txt File and How to Configure It Correctly

Roman Rohoza

Dec 8, 2023 [Updated on Jan 31, 2024]

What Is Robots.txt File and How to Configure It Correctly

Free Complete Site Audit

Access a full website audit with over 300 technical insights.

Trusted by

Free Website SEO Checker & Audit Tool

Scan the site for 300+ technical issues
Monitor your site health 24/7
Track website rankings in any geo

Get started

What is Robots.txt File

Robots.txt file serves to provide valuable data to the search systems scanning the web. Before examining the pages of your site, the searching robots perform verification of this file. Due to such procedure, they can enhance the efficiency of scanning. This way you help searching systems to perform the indexation of the most important data on your site first. But this is only possible if you have correctly configured robots.txt.

Just like the directives of robots.txt file, the noindex instruction in the meta tag robots is no more than just a recommendation for robots. That is the reason why they cannot guarantee that the closed pages will not be indexed and will not be included in index. Guarantees in this concern are out of place. If you need to close for indexation some part of your site, you can use a password to close the directories.

Important! For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.
Google Search Console Help

If your website has no robot txt file, your website will be crawled entirely. It means that all website pages will get into the search index which can cause serious problems for SEO.

Robots.txt Syntax

User-Agent: the robot to which the following rules will be applied (for example, “Googlebot”). The user-agent string is a parameter which web browsers use as their name. But it contains not only the browser’s name but also the version of the operating system and other parameters. Due to user agent you can determine a lot of parameters: the name of operating system, its version; check the device on which the browser is installed; define the browser’s functions.

Disallow: the pages you want to close for access (when beginning every new line you can include a large list of the directives alike). Every group User-Agent / Disallow should be divided with a blank line. But non-empty strings should not occur within the group (between User-Agent and the last directive Disallow). Use the directive carefully because some important pages can be Disallowed by robots.txt by mistake.

Hash mark (#) can be used when needed to leave commentaries in the robots.txt file for the current line. Anything mentioned after the hash mark will be ignored. This comment is applicable both for the whole line and at the end of it after the directives. Catalogues and file names are sensible of the register: the searching system accepts «Catalog», «catalog», and «CATALOG» as different directives.

Host: is used for Yandex to point out the main mirror site. That is why if you perform 301 redirect per page to stick together two sites, there is no need to repeat the procedure for the file robots.txt (on the duplicate site). Thus, Yandex will detect the mentioned directive on the site which needs to be stuck.

Crawl-delay: you can limit the speed of your site traversing which is of great use in case of high attendance frequency on your site. Such an option is enabled due to avoid problems with an extra load of your server caused by the diverse searching systems processing information on the site.

Regular expressions: to provide more flexible settings of directives, you can use two symbols mentioned below:
* (star) – signifies any sequence of symbols,
$ (dollar sign) – stands for the end of the line.

Useful links: Google Guideline on Creating Robots.txt and Guide on Full Robots.txt Syntax

How to Configure Robots.txt: Rules and Examples

Ban on the entire site scanning

User-agent: *

Disallow: /

This instruction needs to be applied when you create a new site and use subdomains to provide access to it.
Very often when working on a new site, web developers forget to close some part of the site for indexation and, as a result, index systems process a complete copy of it. If such a mistake took place, your master domain needs to undergo 301 redirect per page.

Permission to crawl the entire site

User-agent: *

Disallow:

Ban on the crawling of a particular folder

User-agent: Googlebot

Disallow: /no-index/

Ban on the crawling page for the certain bot

User-agent: Googlebot

Disallow: /no-index/this-page.html

Ban on the crawling of a certain type of files

User-agent: *

Disallow: /*.pdf$

Permission to crawl a page for the certain bot

User-agent: *

Disallow: /no-bots/block-all-bots-except-rogerbot-page.html

User-agent: Yandex

Allow: /no-bots/block-all-bots-except-Yandex-page.html

Website link to sitemap

User-agent: *

Disallow:

Sitemap: http://www.example.com/none-standard-location/sitemap.xml

In general it is considered to be an issue when Robots.txt file does not contain a link to XML sitemap file.

Peculiarities to take into consideration when using this directive if you are constantly filling your site with unique content:

do not add a link to your sitemap in robots text file;
choose some unstandardized name for the sitemap.xml file (for example, my-new-sitemap.xml and then add this link to the searching systems using webmasters).

Great many unfair webmasters parse the content from other sites but their own and use them for their own projects.

Check your website pages for indexation status

Detect all noindexed URLs on and find out what site pages are allowed to be crawled by search engine bots

Disallow or Noindex

If you don’t want some pages to undergo indexation, noindex in meta tag robots is more advisable. To implement it, you need to add the following meta tag in the section of your page:

<meta name="robots" content="noindex, follow">

Using this approach, you will:

avoid indexation of certain page during the web robot’s next visit (you will not need then to delete the page manually using webmasters);
manage to convey the link juice of your page.

Robots.txt is better to close such types of pages:

administrative pages of your site;
search data on the site;
pages of registration/authorization/password reset.

Also, you can check out the robots.txt tutorial by Darren Taylor, founder of SEM academy.

How Robots.txt Can Help Your SEO Strategy

First of all, it’s all about crawling budget. Each site has own crawling budget which is estimated by search engines personally. Robots.txt file prevents your website from crawling by search bots unnecessary pages, like duplicate pages, junk pages and not quality pages. The main problem is that the index of search engines gets something that should not be there – pages that do not carry any benefit to people and just litter the search. You can easily find out how to check duplicate content on website with our guides.

But how can it harm SEO? The answer is easy enough. When search bots are getting to the website for crawling, they are not programmed to explore the most important pages. Often they scan the entire website with all its pages. So the most important pages can be simply not scanned due to the limited crawling budget. Therefore Google or any other search engine starts to range your website regarding information it has received. This way, your SEO strategy is in danger to be failed because of not relevant pages.

Check if This URL is Blocked by Robots.txt File with Sitechecker’s Tool

The Robots.txt Tester by SiteChecker is crucial for website developers and SEO professionals. It analyzes and tests robots.txt files, key in directing search engine bots on which site parts to crawl and index. Proper configuration of this file is essential for SEO, ensuring correct search engine access and interpretation of site content. This tool aids in optimizing website visibility and preventing indexing issues.

Tool goes beyond basic testing by highlighting syntax errors and offering optimization recommendations. It helps users understand how search engine bots interpret their robots.txt file, ensuring alignment with SEO goals. The tool also suggests improvements to prevent errors that could impact search performance, making it an invaluable resource for maintaining a website’s search engine friendliness and overall health.

Validate Your Robots.txt!

Test and fine-tune your robots.txt with our easy-to-use tool.

Conclusion

The robots.txt file is a critical component in managing how search engines crawl and index a website. It provides instructions to search bots, allowing them to prioritize important pages and avoid unnecessary ones, thus optimizing the crawling budget. Correctly configuring robots.txt helps in preventing search engines from indexing duplicate, administrative, or low-quality pages, ensuring only beneficial content appears in search results. Misconfiguration, however, can lead to important pages being overlooked, negatively impacting SEO. It’s essential for webmasters to use robots.txt wisely, balancing the need to block certain pages while ensuring valuable content is accessible for indexing, thereby supporting an effective SEO strategy.

FAQ

What is the primary purpose of a robots.txt file?

Robots.txt guides search engine bots on which pages or sections of a website they should or shouldn't access, ensuring efficient indexing.

Can I use the robots.txt file to prevent specific pages from appearing in search results?

Not directly. While robots.txt can block bots from accessing pages, it doesn't guarantee non-indexing. For that, the noindex meta tag is recommended.

What happens if my website doesn't have a robots.txt file?

Without a robots.txt file, search engine bots may crawl and index all pages of your website, which can lead to SEO issues.

Is the robots.txt file case-sensitive?

Yes, directives in robots.txt are case-sensitive. For example, "Catalog" and "catalog" are considered different by search engines.

How can I comment within a robots.txt file?

Use the hash mark (#). Anything following this mark on the same line will be ignored by bots.

What is the Crawl-delay directive used for?

It limits the speed at which bots crawl a site, helping avoid server overloads.

Can I specify the main version of my website in the robots.txt file?

Yes, using the Host directive, mainly for Yandex, to point out the primary version of a site, especially useful for mirrored sites.

What is the importance of a sitemap link in the robots.txt file?

It helps search engine bots locate the sitemap of your site, aiding efficient indexing. However, it's recommended to use non-predictable sitemap names to prevent content scraping.

What's the difference between using Disallow in robots.txt and noindex in meta tags?

Disallow in robots.txt prevents bots from accessing specified pages, while noindex in meta tags advises bots not to index the page. Both serve different purposes in controlling indexing.

Why is the crawling budget crucial for SEO?

Crawling budget determines how many pages bots will crawl on a site during a visit. If irrelevant pages consume the budget, essential pages might go uncrawled and unindexed, impacting SEO.