Skip to content

What is robots.txt

Key idea:

robots.txt is a text file at the domain root (/robots.txt) telling search bots which URLs to crawl and which to skip. Robots Exclusion Protocol (REP, formalized in RFC 9309, 2022). Important: robots.txt is a *recommendation*, not a hard block. Malicious bots ignore it. For real prevention, use auth/firewall.

Below: details, example, related terms, FAQ.

Try it now — free →

Details

  • User-agent: * — rules apply to every bot
  • Disallow: /admin/ — disallow a path
  • Allow: /admin/public/ — explicit allow inside a disallowed directory
  • Sitemap: https://example.com/sitemap.xml — sitemap pointer
  • Crawl-delay: 5 — at most 1 request per 5 seconds (Google ignores, Yandex/Bing honor)

Example

User-agent: *
Disallow: /admin/
Disallow: /*.pdf$
Allow: /admin/public/
Sitemap: https://example.com/sitemap.xml

Related Terms

Understanding the Syntax of robots.txt

Common Misconfigurations in robots.txt

Testing and Validating Your robots.txt File

Learn more

Frequently Asked Questions

What if robots.txt is unavailable?

Google keeps crawling as usual, Yandex too. But if robots.txt returns 5xx — Google halts crawl for 12 hours. Serve a 200 or 404.

Does robots.txt hide a page from the index?

No. Disallow blocks *crawling* but not *indexing* (an external link can make the URL appear in the index without content). For indexing use meta noindex.

Do wildcards work?

In major bots yes: <code>Disallow: /*.pdf$</code>. RFC 9309 formalized wildcards.

Try the live tool that powered this guide

Free plan — 20 monitors, 5-minute checks, no card required. Upgrade for 1-minute interval and multi-region monitoring.