Skip to content

AI Bot Crawlers: The Complete Reference for 2026

Ali Khallad7 min readUpdated
June 28, 2026 , 7 min read
Three cards showing the three jobs an AI crawler can do: Training with tokens GPTBot, ClaudeBot, CCBot; Search index, highlighted, with OAI-SearchBot, PerplexityBot, Googlebot and the note that blocking one removes you from cited AI answers; and Live fetch with ChatGPT-User and Claude-User, which often ignore robots.txt.
Share

Most robots.txt advice treats AI crawlers as one switch: allow the AI bots or block them. That framing is what gets brands quietly dropped from AI answers. An AI bot can be doing one of three different jobs, and the rule you write for one job has nothing to do with the others. Block the wrong one and you stay in model training but vanish from the search index an assistant cites; block a different one and the opposite happens. This is the reference for telling them apart, with the exact robots.txt tokens and what each block actually costs you.

How AI bots differ from classic search bots

A classic search crawler like Googlebot has one job: build the index that ranks your pages. AI changed that into three separate jobs, and the vendors split them across separate user agents on purpose so you can allow one and refuse another. The three jobs are training the model, building the search index the assistant cites at answer time, and fetching a single page live when a user asks about it. A page can be welcome for one and blocked for another, which is the entire point of keeping them apart.

Every token below comes from the vendor’s own documentation: OpenAI, Anthropic, Perplexity, Google, and Apple. We also ran a server-side probe to see what several of these agents actually send when they fetch a page, which is written up in what nginx logs reveal about AI traffic.

The three jobs an AI bot can have

1. Training crawlers

These collect content to train future models. Blocking them keeps your pages out of the next training run. It does not affect whether you get cited in AI search today, because that is a different system. This is the group to think about if your concern is your content being used to train a model, not your visibility in answers.

robots.txt tokenOwnerIf you block it
GPTBotOpenAIYour content is excluded from future model training; AI search citations are unaffected
ClaudeBotAnthropicSignals your future content should be excluded from Claude training data
CCBotCommon CrawlYou leave the open crawl corpus that many AI training datasets are built from
BytespiderByteDanceYour content is excluded from ByteDance model training
Google-ExtendedGoogleControl token only: opts your content out of Gemini training and grounding. It does not crawl, and it does not affect Google Search or AI Overviews
Applebot-ExtendedAppleControl token only: opts your content out of Apple foundation-model training. Applebot can still crawl you for Siri and Spotlight

Two of these are worth a closer look. Google-Extended and Applebot-Extended are not crawlers at all. They are control tokens that govern how already-crawled data is used, so disallowing them changes training and grounding behavior without removing you from search. Per Google’s documentation, Google-Extended covers Gemini training and grounding and has no effect on your inclusion in Google Search, which means it is the wrong tool if your goal is to opt out of AI Overviews.

2. Search-index crawlers

These build the index an assistant pulls from when it cites sources. Blocking one of these is how a brand disappears from cited AI answers without realizing it. If you want to be recommended in AI search, these are the crawlers you must let through.

robots.txt tokenOwnerIf you block it
OAI-SearchBotOpenAIYou can drop out of the sources ChatGPT search cites
Claude-SearchBotAnthropicYour content is not indexed for Claude’s search results
PerplexityBotPerplexityYou stop appearing as a linked source in Perplexity answers
GooglebotGoogleYou leave Google Search, and with it AI Overviews and AI Mode, which draw on the same index
ApplebotAppleYou disappear from Siri, Spotlight, and Safari search
BingbotMicrosoftYou leave the Bing index that also feeds ChatGPT search and Copilot answers

The Googlebot row carries the nuance most people miss. Google AI Overviews and AI Mode are built on the main Search index, not on a separate AI crawler, so there is no AI-only token to block or allow for them. If Googlebot can reach you, you are eligible for AI Overviews; if it cannot, you are not, and Google-Extended does nothing to change that either way.

3. On-demand user fetchers

These fetch a single page live, in the moment a user asks the assistant about it or pastes your link. They act on behalf of a person, so several of them ignore robots.txt by design. Blocking them is possible but it cuts off the assistant from quoting your page when a real user is asking about you.

robots.txt tokenOwnerBehavior
ChatGPT-UserOpenAIFetches a page live when a ChatGPT user follows or asks about a link
Claude-UserAnthropicRetrieves your page in response to a Claude user’s question
Perplexity-UserPerplexityFetches your page to answer a user; per Perplexity’s docs it generally ignores robots.txt

Because these are user-directed, robots.txt is not a reliable way to stop them. If you genuinely need to block a user fetcher, you do it at the firewall or by IP, and you accept that you are reducing your own visibility when a user is actively trying to learn about your brand. When these agents do fetch, each one leaves a different fingerprint in your server logs, which we mapped assistant by assistant in our nginx probe. A few more bots are worth knowing by name across these buckets: Amazon’s Amazonbot, Meta’s training and product-indexing crawler Meta-ExternalAgent, and Meta’s user-directed fetcher Meta-ExternalFetcher, which may bypass robots.txt the same way the others here do.

The misconfiguration that quietly removes you from AI answers

The common mistake is a single line copied from a “block AI bots” snippet that disallows the search-index crawler along with the training one. A site decides it does not want to feed model training, reaches for a blanket block, and takes OAI-SearchBot or PerplexityBot down with GPTBot. Training and citation are different decisions. You can refuse training and still want every search crawler through, because being in the training set and being a citable source in a live answer are not the same thing and rarely call for the same rule.

The reverse trap is quieter. A site allows everything, assumes it is fine, and never notices that a CDN or firewall rule is returning errors to OAI-SearchBot while GPTBot sails through. The only way to know is to check what each individual token can actually reach, which is what our robots.txt AI bot checker does, and why we argue that crawl access is the first thing to verify before any other AI visibility work.

How to verify a bot is actually who it claims to be

A user-agent string is trivial to spoof, and plenty of scrapers wear a GPTBot or Googlebot label to look legitimate, so Google explicitly warns not to trust the string alone. Every major vendor gives you two stronger checks: reverse DNS on the source IP, and a published list of official IP ranges you can match against.

  • OpenAI publishes its ranges as JSON, including openai.com/searchbot.json for OAI-SearchBot, with separate files for GPTBot and ChatGPT-User.
  • Anthropic lists every bot IP at claude.com/crawling/bots.json.
  • Perplexity publishes perplexity.com/perplexitybot.json and a separate file for Perplexity-User.
  • Google verifies through reverse DNS to googlebot.com or google.com and a published IP range list.
  • Apple uses reverse DNS in the applebot.apple.com domain plus a CIDR list at search.developer.apple.com/applebot.json.

The reliable rule: match the source IP against the vendor’s published list or its reverse DNS, and treat the user-agent string as a label, not proof.

How to decide what to allow

For most brands that want AI visibility, the answer is short. Allow every search-index crawler without exception, because those are the ones that decide whether an assistant can cite you. Allow the on-demand fetchers too, since blocking them only hurts you when a real user is asking about your brand. Training is the one genuine choice: if you do not want your content training models, disallow the training crawlers and the control tokens, and leave the search and fetch agents alone. Make that one decision deliberately, write the rules per token, and then verify them, because a robots.txt that looks right and a robots.txt that an AI crawler can actually follow are not always the same file. The companion move is a clean llms.txt once access is sorted, and our robots.txt checker to confirm each token resolves the way you intended.

Allowing a crawler is only the first half of the picture. Which index each assistant actually reads decides how you then get into it, and that differs by engine, ChatGPT on Bing, Claude on Brave, Perplexity on its own. We map all of it, with the submission path for each, in which search index each AI assistant reads.

Last reviewed June 2026. AI crawlers change often; the vendor documentation linked above is the source of truth, and we re-check these tokens each quarter.