robots.txt and AI Bots: GPTBot, ClaudeBot, Google-Extended

Anatoly Oshmanovsky

SEO

robots.txt and AI Bots: GPTBot, ClaudeBot, Google-Extended

Published: 15.06.2026 · ~4 min · 39 views

Short answer. AI crawlers (OpenAI's GPTBot, Anthropic's ClaudeBot, PerplexityBot, Google-Extended, CCBot) respect robots.txt just like search bots do. Using User-agent plus Allow/Disallow directives, you decide which bots may read your content for training and AI answers, and which to block. The choice depends on strategy: openness for citations, or protection of your content.

What AI bots are and why they want your site

AI crawlers gather content for two purposes: training models and building answers in real time (RAG, AI search). Open access increases the chance your brand gets cited in ChatGPT, Claude or Perplexity. Closed access protects unique content from being used without attribution.

robots.txt is an agreement, not a technical barrier. Well-behaved bots (GPTBot, ClaudeBot) honor it. For hard blocking you need server-side rules or a WAF.

Table: major AI crawlers

User-agent	Who	Purpose
GPTBot	OpenAI	Training GPT models
OAI-SearchBot	OpenAI	Indexing for ChatGPT Search
ClaudeBot	Anthropic	Training and indexing for Claude
PerplexityBot	Perplexity	AI search and answers
Google-Extended	Google	Gemini training (does not affect Search)
CCBot	Common Crawl	Open dataset used by many models

Example: open everything to AI bots

If your strategy is maximum citability, allow crawlers access to all public content:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Allow: /

Sitemap: https://enterno.io/sitemap.xml

Don't forget the Sitemap: line — it helps bots discover every page.

Example: block training, allow search

A common strategy: close training crawlers but keep live AI-search bots open so you still get cited.

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://enterno.io/sitemap.xml

Here GPTBot and Google-Extended (training) are blocked, while OAI-SearchBot and PerplexityBot (live search) are allowed.

The Content-Signal directive

A newer IETF initiative is the Content-Signal directive, which declares permitted uses of your content: search, AI training, AI input. It's a more granular tool than a blunt Disallow.

User-agent: *
Content-Signal: search=yes, ai-train=no, ai-input=yes
Allow: /

In this example search is allowed, model training is not, and use as context for an AI answer (ai-input) is. Support depends on the bot.

Don't block useful paths by accident. Closing /API документацию/ or /assets/ to AI bots is unnecessary, and a stray Disallow: / inside a wildcard User-agent: * block will lock out everyone.

What to always block

Private areas: /admin/, /dashboard/, login pages.
Utility paths: internal APIs, cart, parameterized search.
Duplicates: pages with UTM tags and session parameters.

These rules are the same for search and AI bots. We cover robots.txt fundamentals in the robots.txt guide.

How to validate and strengthen

After editing robots.txt, check the syntax and make sure the bots you want aren't blocked by accident. Complement the file with a content map — see the llms.txt guide — and a correct sitemap.xml. A free tool can assess your site's readiness for AI crawlers end to end.

FAQ

Do AI bots obey robots.txt?

The major well-behaved crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot) do. To reliably block bad actors you need server-side rules.

Will Google-Extended block my normal search?

No. Google-Extended only governs use for Gemini and has no effect on Google Search indexing — that's a separate Googlebot.

Should I block all AI bots?

It depends on your goals. Blocking protects content but costs you citations in AI answers, a growing channel for traffic and brand awareness.

What about CCBot?

CCBot builds Common Crawl, an open dataset many models train on. Whether to allow it depends on your policy toward training data.

Does Content-Signal work today?

It's an evolving initiative; support depends on the bot. Adding the directive is safe — bots that don't support it simply ignore it.

Check your site's AI readiness →

Check your website right now

Check your site →

robots.txt and AI Bots: GPTBot, ClaudeBot, Google-Extended

What AI bots are and why they want your site

Table: major AI crawlers

Example: open everything to AI bots

Example: block training, allow search

The Content-Signal directive

What to always block

How to validate and strengthen

FAQ

Do AI bots obey robots.txt?

Will Google-Extended block my normal search?

Should I block all AI bots?

What about CCBot?

Does Content-Signal work today?

Start monitoring for free