robots.txt — What It Is and How It Works

Igor Verentsov

What is robots.txt

By Igor Verentsov · Updated May 23, 2026

Key idea:

robots.txt is a text file at the domain root (/robots.txt) telling search bots which URLs to crawl and which to skip. Robots Exclusion Protocol (REP, formalized in RFC 9309, 2022). Important: robots.txt is a *recommendation*, not a hard block. Malicious bots ignore it. For real prevention, use auth/firewall.

Below: details, example, related terms, FAQ.

Free online tool — HTTP header checker: instant results, no signup.

Check your site →

Details

User-agent: * — rules apply to every bot
Disallow: /admin/ — disallow a path
Allow: /admin/public/ — explicit allow inside a disallowed directory
Sitemap: https://example.com/sitemap.xml — sitemap pointer
Crawl-delay: 5 — at most 1 request per 5 seconds (Google ignores, Yandex/Bing honor)

Example

User-agent: *
Disallow: /admin/
Disallow: /*.pdf$
Allow: /admin/public/
Sitemap: https://example.com/sitemap.xml

Related Terms

Understanding the Syntax of robots.txt

The syntax of the robots.txt file is crucial for effectively communicating with search engine crawlers. The file consists of a series of directives that tell bots which parts of your site to crawl or avoid. The basic structure includes User-agent and Disallow directives.

Here’s a breakdown of the syntax:

User-agent: Specifies the web crawler the rule applies to. You can specify a specific bot (e.g., Googlebot) or use * to apply to all bots.
Disallow: Indicates the URL path that should not be crawled. If you want to block an entire directory, you would write Disallow: /directory/.
Allow: This directive can be used to override a Disallow directive, permitting access to specific pages within a blocked directory.
Sitemap: Including a Sitemap directive can help bots find the sitemap of your site, which is beneficial for indexing.

Here’s a simple example of a robots.txt file:

User-agent: *
Disallow: /private/
Allow: /private/public-page.html
Sitemap: https://www.example.com/sitemap.xml

This configuration tells all bots to avoid crawling the /private/ directory except for the public-page.html, while also providing the location of the sitemap.

Common Misconfigurations in robots.txt

Misconfigurations in the robots.txt file can lead to unintended consequences, such as blocking search engines from indexing important pages on your website. Here are some common pitfalls:

Incorrect User-agent: Specifying the wrong user-agent can lead to essential crawlers being blocked. Always double-check the user-agent string to ensure it matches the bot you intend to target.
Overly Broad Disallow Rules: Using Disallow: / will block all crawlers from accessing any part of your site, which is often not the desired outcome. Ensure that your disallow rules are specific enough to protect sensitive areas without hindering crawling.
Missing Allow Directives: If you block a directory without allowing specific pages, those pages will also be blocked. Use Allow directives judiciously to permit access to critical content.
Ignoring Case Sensitivity: URLs are case-sensitive. Ensure that your robots.txt entries match the exact case of your URLs.

To troubleshoot issues, use tools like Google Search Console, which can analyze your robots.txt file and provide insights on how it affects crawling and indexing. Regular audits can help ensure your file is functioning as intended.

Testing and Validating Your robots.txt File

Testing and validating your robots.txt file is essential to ensure it functions as intended. Several tools and methods can help you verify the correctness of your file:

Google Search Console: This tool provides a robots.txt Tester that allows you to input specific URLs and check if they are blocked or allowed based on your robots.txt rules. This is an invaluable resource for diagnosing issues.
Online Validators: Various online tools can check the syntax of your robots.txt file, ensuring it adheres to the Robots Exclusion Protocol. Examples include robots-txt-validator.com.
Manual Testing: You can manually check the robots.txt file by visiting https://www.yourdomain.com/robots.txt in your browser. This allows you to see the current directives and verify that they are correctly formatted.

When testing, consider using different user-agents to simulate how various bots will interpret your robots.txt file. This comprehensive approach will help ensure that your directives are clear and effective, safeguarding your website's content while optimizing it for search engine visibility.

Learn more

How-to

Glossary

What is CDC (Change Data Capture)

Research

Frequently Asked Questions

What if robots.txt is unavailable?

Google keeps crawling as usual, Yandex too. But if robots.txt returns 5xx — Google halts crawl for 12 hours. Serve a 200 or 404.

Does robots.txt hide a page from the index?

No. Disallow blocks *crawling* but not *indexing* (an external link can make the URL appear in the index without content). For indexing use meta noindex.

Do wildcards work?

In major bots yes: <code>Disallow: /*.pdf$</code>. RFC 9309 formalized wildcards.

Try the live tool that powered this guide

Free plan — 10 monitors, checks every 5 min, no card required. Upgrade for 1-minute interval and multi-region monitoring.

Start free See pricing