Short answer. AI crawlers are bots like GPTBot, ClaudeBot and PerplexityBot that fetch pages, extract text, and use it for training and for real-time answers. They handle heavy JavaScript poorly and prefer clean HTML, semantic markup and structured data. To be read correctly you need robots.txt access, server-side rendering of key content, and extractable answer paragraphs.
Who AI crawlers are
They are automated agents that download HTML pages for two purposes: enriching training data and retrieving fresh information to answer a user (retrieval). Behavior resembles search bots, but priorities differ — they care about fact extractability, not ranking position.
| Crawler | Operator | Purpose |
|---|---|---|
| GPTBot | OpenAI | Training and ChatGPT search |
| ClaudeBot | Anthropic | Training and Claude answers |
| PerplexityBot | Perplexity | Cited answers |
| Google-Extended | Control of use in AI Overviews / Gemini |
What blocks AI crawlers
- JavaScript-only content. Many AI bots don't execute JS fully. Client-rendered text may go unseen.
- Blocking in robots.txt. Accidentally disallowing AI bots costs you citations. Manage it deliberately — see robots.txt for AI crawlers.
- Text inside images. Images with text but no alt and no HTML equivalent are invisible to machines.
- Infinite pagination and JS navigation without real links make crawling hard.
- Heavy pages and timeouts — a crawler may abort the fetch.
Simple rule: if the content is visible with JavaScript disabled and reachable via a normal link, almost any AI crawler can read it.
How to check what the bot sees
Open the page with JavaScript disabled, or inspect the raw HTML (View Source, not the DOM inspector). Whatever is there is what the crawler sees. It's also worth checking the server response and headers.
curl -A "GPTBot" -s https://example.com/ | head -n 40
This requests the page with the GPTBot user-agent and prints the first lines of HTML — so you see content through the bot's eyes.
How to make content extractable
- Server-side rendering (SSR/SSG) of key text is the main requirement.
- Semantic HTML:
<h1>–<h3>,<p>,<ul>, tables instead of a wall of<div>. - Direct answers at the start of a section — extractable paragraphs.
- Structured data via Schema.org — see structured data.
- A clean sitemap and logical internal linking help discover every page.
- An llms.txt file as a map of priority content — see the llms.txt guide.
Controlling access without losing citations
Sometimes you want to disallow training but allow answers. A basic robots.txt example:
User-agent: GPTBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: *
Allow: /
Here the main AI bots get access to the whole site. Detailed allow/disallow scenarios are covered in the general robots.txt guide.
FAQ
Do AI crawlers run JavaScript?
Partially and unpredictably. Don't rely on it — render key content in HTML server-side.
Should I block AI bots?
It depends on goals. Blocking protects against training use but also removes citations in answers. Decide deliberately per bot.
Does load speed matter?
Yes. Slow pages raise the timeout risk. Basic speed optimization helps both AI bots and users.
How do I know if GPTBot visited?
Check server logs by user-agent (GPTBot, ClaudeBot, PerplexityBot) — this shows real crawler activity.