Skip to content

What is robots.txt

Key idea:

robots.txt is a text file at the domain root (/robots.txt) telling search bots which URLs to crawl and which to skip. Robots Exclusion Protocol (REP, formalized in RFC 9309, 2022). Important: robots.txt is a *recommendation*, not a hard block. Malicious bots ignore it. For real prevention, use auth/firewall.

Below: details, example, related terms, FAQ.

Details

  • User-agent: * — rules apply to every bot
  • Disallow: /admin/ — disallow a path
  • Allow: /admin/public/ — explicit allow inside a disallowed directory
  • Sitemap: https://example.com/sitemap.xml — sitemap pointer
  • Crawl-delay: 5 — at most 1 request per 5 seconds (Google ignores, Yandex/Bing honor)

Example

User-agent: *
Disallow: /admin/
Disallow: /*.pdf$
Allow: /admin/public/
Sitemap: https://example.com/sitemap.xml

Related Terms

Learn more

Frequently Asked Questions

What if robots.txt is unavailable?

Google keeps crawling as usual, Yandex too. But if robots.txt returns 5xx — Google halts crawl for 12 hours. Serve a 200 or 404.

Does robots.txt hide a page from the index?

No. Disallow blocks *crawling* but not *indexing* (an external link can make the URL appear in the index without content). For indexing use meta noindex.

Do wildcards work?

In major bots yes: <code>Disallow: /*.pdf$</code>. RFC 9309 formalized wildcards.