When someone asks an AI product about a site I run, does the product actually fetch the page, or does it answer from an index it built earlier? I wanted a straight answer, so I set up an nginx probe and prompted the major AI assistants with queries that should force a live fetch. This post is what the server recorded, and what you can safely measure from it.
Two different signals
“AI traffic” usually means one of two things, and nginx logs make the difference obvious.
- Provider-side fetch. The AI product hits the origin itself, usually with a dedicated user-agent and no referrer.
- Real clickthrough visit. A human reads the AI answer, clicks a citation, and arrives as a normal browser with the AI product as the referrer.
Folding both into a single AI-traffic number hides the most useful distinction in the data. One is the model reaching out to read you. The other is a human reading you because the model pointed.
The probe
Custom nginx log format capturing the headers the default combined log compresses out:
log_format ai_probe escape=json
'{'
'"time":"$time_iso8601",'
'"ip":"$remote_addr",'
'"uri":"$request_uri",'
'"status":$status,'
'"ua":"$http_user_agent",'
'"referer":"$http_referer",'
'"accept":"$http_accept"'
'}';
Each assistant got a prompt pointing at a unique query string (/?ai=chatgpt, /?ai=claude, and so on), so attributing hits was a grep, not a guess. I reran prompts across sessions so a transient cache hit would not hide the retrieval path.
What each assistant sent
All values below are copied from the probe’s log file, not paraphrased.
| AI product | User-agent sent | Accept header | robots.txt first? | Fetched the page? |
|---|---|---|---|---|
| ChatGPT | ChatGPT-User/1.0 | Chrome-style: text/html,application/xhtml+xml,...*/*;q=0.8 | no | yes |
| Claude | Claude-User/1.0 | */* | yes | yes |
| Perplexity | Perplexity-User/1.0 | (empty, no Accept sent) | yes (via PerplexityBot) | yes |
| Gemini | not observed (see below) | not applicable | not applicable | no |
| Microsoft Copilot | plain Chrome 135 on Linux x86_64, no distinct UA | Chrome-style | no | yes |
| Grok | plain Mac Safari 26 / Chrome 143, no distinct UA | Chrome-style | no | yes |
| Meta AI (Muse Spark) | meta-webindexer/1.1 | */* | no | yes |
| Manus | Manus-User/1.0 suffix on a Chrome UA | Chrome-style | no | yes |
A question that came up after the first version of this post was whether any assistant negotiates for text/markdown. None of them did. ChatGPT sends a full Chrome-style Accept string, Claude sends a wildcard, Perplexity sends no Accept header at all.
ChatGPT: multi-IP bursts
ChatGPT-User hits the origin from multiple source IPs inside the same burst. On a separate site I run (carticy.com), a recent 24-hour window captured ChatGPT-User requests from five distinct Azure ranges: 23.98.x.x, 20.215.x.x, 40.67.x.x, 51.8.x.x, and 51.107.x.x. This matches OpenAI’s own description of the agent in their bots documentation. If you are rate-limiting based on a single source IP, you will under-count.
Claude: robots.txt first, every time
Claude-User pulled /robots.txt before every page fetch, out of Anthropic-owned IP space in the 216.73.216.0/24 range. Redirects were followed normally. The robots precheck matches Anthropic’s behavior as documented in their crawler docs. If you want Claude to skip your site, a User-agent: Claude-User disallow is the live control. Anthropic also runs two other bots that should not be confused with this one: Claude-SearchBot (their search index) and ClaudeBot (their training crawler). Only Claude-User is the user-initiated retrieval signal.
Perplexity: direct fetch, no niceties
Perplexity-User fetched the page directly. No Accept header, no referrer. Separately, PerplexityBot (their search-indexing crawler, not the user-retrieval one) pinged /robots.txt. I captured few Perplexity retrieval runs in total, and Perplexity can answer from its own index without hitting the origin, so the safe wording is that Perplexity can retrieve live; it does not have to. See Perplexity’s bots documentation for their own framing.
Gemini: no hit, not even once
Two separate observations, which the first draft of this post incorrectly ran together.
- Observed. Zero requests arrived from any Google user-agent during the Gemini prompt window. Gemini answered entirely from its own index; it did not perform a live provider-side fetch that reached my origin.
- Structural. Google does not publish a retrieval-specific user-agent for Gemini. Per Google’s own crawler documentation, AI Overviews and AI Mode ground on the same Search index that
Googlebotpopulates. If Gemini ever does live-fetch, it would arrive asGooglebot, indistinguishable from ordinary Search indexing.
The practical consequences worth stating:
- A
Googlebothit cannot be attributed to Gemini vs classic Search from the request alone. - Blocking
Google-Extendeddoes not blockGooglebot. It gates whetherGooglebot-crawled content may be used for Gemini training and grounding. - Any AI-traffic dashboard built on server logs is observably asymmetric by vendor. Plan for the asymmetry; do not paper over it.
Copilot and Grok: invisible by default
Microsoft Copilot fetched the page as plain Chrome 135 on Linux x86_64, with a full browser-style Accept header and the usual burst of CSS, JS, and image requests. No distinct Copilot user-agent, no Bingbot activity during the prompt window. Per Microsoft’s guidance for generative-AI and public websites, Copilot grounds on the Bing index populated by Bingbot, but the live fetch we observed was not Bingbot. From the log operator’s side, you cannot positively attribute a Copilot fetch to Copilot by user-agent alone.
Grok fetched the page as plain Mac Safari 26 (and in a second run, plain Mac Chrome 143). No distinct UA, no suffix, no header signal that would let you attribute the hit to xAI from the request alone. Grok documents no retrieval-specific bot. Same observability problem as Copilot, with even less documentation to fall back on.
Between Gemini, Copilot, and Grok, three of the major AI products are either invisible in provider-fetch logs (Gemini, in the run we captured) or indistinguishable from an ordinary human visitor (Copilot and Grok). Any HTTP-based AI-traffic dashboard that ignores this asymmetry is reporting a partial picture.
Meta AI: two documented bots, one observed, no confident mapping
Meta AI, prompted through its Muse Spark surface, triggered a fetch from meta-webindexer/1.1 with Accept: */*. Meta’s own web-crawlers documentation describes a different bot, Meta-ExternalFetcher, as the user-initiated retrieval bot for Facebook, Messenger, Instagram, and WhatsApp AI features, and documents that it may bypass robots.txt on the grounds that a human or agent followed a specific link.
We observed one of these bots in one session. We did not observe both, and the probe cannot isolate which factors determine when each one fires: product surface, first-time vs repeat fetch, prior index state, or something else. Treat meta-webindexer and Meta-ExternalFetcher as both belonging to Meta’s retrieval-class family, and if you need to block either of them, target them explicitly by UA rather than assume a single name covers all Meta AI products.
Manus: the agent that labels itself
Manus fetched as Mozilla/5.0 ... Chrome/132.0 ... ; Manus-User/1.0. The Manus-User/1.0 suffix is the retrieval signal. Unlike the other agents tested, Manus rendered the full page: HTML, every CSS file, every JS file, every image. Of the agentic AI products in this probe, Manus is the one that labels itself clearly in the UA and is easiest to identify in logs.
What a product can track without overclaiming
Two tracking classes hold up against the logs.
Provider fetch
Vendor-documented or probe-observed retrieval user-agents hitting your origin: ChatGPT-User, Claude-User, Perplexity-User, Manus-User, Meta-ExternalFetcher (documented), and meta-webindexer (observed; Meta bot class not fully clear to us).
Real visit
Normal browser user-agent with an AI product as the referrer: chatgpt.com, claude.ai, perplexity.ai, gemini.google.com, copilot.microsoft.com, grok.com, meta.ai, and google.com / bing.com as broader buckets (with no way to isolate AI Mode or Copilot from classic Search using HTTP alone).
Search-indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, Bingbot) are a separate signal. They are not live retrieval against a specific user query; folding them into the provider-fetch bucket turns the metric into noise. Training bots (GPTBot, ClaudeBot, CCBot) are a third separate signal, and they have no business inside a retrieval count.
Appendix: vendor-documented bot taxonomy
| Bot | Company | Class | Source |
|---|---|---|---|
ChatGPT-User | OpenAI | retrieval | platform.openai.com/docs/bots |
OAI-SearchBot | OpenAI | search_indexing | platform.openai.com/docs/bots |
GPTBot | OpenAI | training | platform.openai.com/docs/bots |
Claude-User | Anthropic | retrieval | Anthropic crawler docs |
Claude-SearchBot | Anthropic | search_indexing | Anthropic crawler docs |
ClaudeBot | Anthropic | training | Anthropic crawler docs |
Perplexity-User | Perplexity | retrieval | docs.perplexity.ai/guides/bots |
PerplexityBot | Perplexity | search_indexing | docs.perplexity.ai/guides/bots |
Meta-ExternalFetcher | Meta | retrieval (may bypass robots.txt) | Meta web crawlers |
Meta-ExternalAgent | Meta | training and product indexing | Meta web crawlers |
meta-webindexer | Meta | observed on Meta AI (Muse Spark) retrieval; class not fully clear to us | Meta crawler docs |
Manus-User | Manus | retrieval (agentic; full browser-style render) | observed in this probe |
Googlebot | search_indexing (also grounds AI Overviews and AI Mode) | Google crawlers | |
Google-Extended | usage control, not a crawler; gates Gemini training and grounding | Google crawlers | |
Bingbot | Microsoft | search_indexing (also grounds Microsoft Copilot) | Copilot public websites |
CCBot | Common Crawl | training (used by many labs) | commoncrawl.org/ccbot |
Microsoft Copilot and Grok are not in this table. Neither vendor documents a retrieval-specific user-agent we can cite; the live fetches we observed from both came in as plain browsers.
Check this on your own site
Our robots.txt checker reads your live file and reports which retrieval, search, and training user-agents it currently allows or blocks. No account needed. That is the fastest way to turn the table above into one concrete answer about your domain.
