Skip to content
← All articles

The Complete Guide to robots.txt for SEO and Crawl Control

The robots.txt file is a simple text file at your website's root that tells search engine crawlers which pages they can and cannot access. Despite its simplicity, misconfigurations can cause serious SEO damage — accidentally blocking your entire site from indexing is more common than you'd think.

How robots.txt Works

When a search engine crawler visits your site, it first checks https://example.com/robots.txt. The file contains directives that specify which paths are allowed or disallowed for each crawler (user-agent). Crawlers follow these rules voluntarily — robots.txt is a protocol, not a security measure.

Basic Syntax

# Allow all crawlers access to everything
User-agent: *
Allow: /

# Block all crawlers from /admin/
User-agent: *
Disallow: /admin/

# Block Googlebot from a specific directory
User-agent: Googlebot
Disallow: /private/

# Sitemap location
Sitemap: https://example.com/sitemap.xml

Key Directives

  • User-agent: Specifies which crawler the rules apply to. * means all crawlers.
  • Disallow: Blocks crawling of specified paths. Disallow: / blocks everything.
  • Allow: Explicitly permits crawling. Useful for overriding broader Disallow rules.
  • Sitemap: Points crawlers to your XML sitemap. Can include multiple sitemaps.
  • Crawl-delay: Requests a delay (in seconds) between requests. Respected by Bing and Yandex, ignored by Google.

Pattern Matching

Most crawlers support pattern matching:

  • * matches any sequence of characters: Disallow: /*.pdf$ blocks all PDF files
  • $ matches the end of URL: Disallow: /page$ blocks /page but not /page/subpage
  • Paths are case-sensitive: /Admin/ is different from /admin/

Common Patterns

Block Internal Search Results

User-agent: *
Disallow: /search
Disallow: /*?q=
Disallow: /*?s=

Block URL Parameters

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*&page=

Block Development/Staging Areas

User-agent: *
Disallow: /staging/
Disallow: /dev/
Disallow: /test/

Allow CSS/JS for Rendering

User-agent: *
Allow: /assets/css/
Allow: /assets/js/
Allow: /assets/images/

robots.txt vs meta robots vs X-Robots-Tag

MethodScopePrevents CrawlingPrevents Indexing
robots.txtEntire paths/directoriesYesNo (indirectly)
meta robotsIndividual pagesNoYes (noindex)
X-Robots-TagAny URL (header)NoYes (noindex)

Important: robots.txt prevents crawling, not indexing. If other sites link to a disallowed page, search engines may still index the URL (without content). Use noindex meta tag to prevent indexing.

Testing robots.txt

  • Google Search Console: robots.txt Tester tool validates your file and tests specific URLs
  • Bing Webmaster Tools: Similar testing functionality
  • Online validators: Multiple tools that parse and test your robots.txt rules
  • Browser: Simply visit https://yoursite.com/robots.txt to verify it's accessible

Common Mistakes

  • Blocking CSS/JS: Prevents crawlers from rendering your page. Google needs these to understand your content.
  • Blocking the entire site: Disallow: / without a specific user-agent. Often left from staging environments.
  • Using robots.txt for sensitive content: It's public and not a security measure. Anyone can read it.
  • Blocking sitemap.xml: Don't disallow your sitemap path.
  • Conflicting rules: Multiple rule sets for the same user-agent can cause unpredictable behavior.
  • No trailing slash confusion: /admin matches /admin, /admin/, and /admin-panel. Use /admin/ for directories.

Best Practices

  • Keep robots.txt at the root domain: https://example.com/robots.txt
  • Always include a Sitemap directive
  • Test changes before deploying
  • Don't use robots.txt to hide sensitive content — use authentication instead
  • Monitor crawl errors in Search Console after changes
  • Review robots.txt quarterly — rules can become outdated

Conclusion

robots.txt is small but powerful. A well-configured file helps search engines crawl your site efficiently, focusing on valuable pages while skipping duplicates and internal tools. Always test changes, and remember: robots.txt controls crawling, not indexing.

Check your website right now

Check your site →
More articles: SEO
SEO
GEO: Generative Engine Optimization for AI Search
15.06.2026 · 33 views
SEO
Redirect Chains: How They Affect SEO and Speed
11.03.2026 · 163 views
SEO
How to Get Cited by ChatGPT and Perplexity
15.06.2026 · 35 views
SEO
Website AI-Readiness Checklist 2026
15.06.2026 · 34 views