Skip to content
Research

What nginx logs prove about AI traffic vs referral traffic

Ali Khallad7 min readUpdated
April 19, 2026 · 7 min read
Top-down photograph of a desk with a laptop showing an nginx access log that contains ChatGPT-User, Claude-User, and Perplexity-User user-agent tokens, next to a printed log excerpt with a highlighted line.

When someone asks an AI product about a site I run, does the product actually fetch the page, or does it answer from an index it built earlier? I wanted a straight answer, so I set up an nginx probe and prompted the major AI assistants with queries that should force a live fetch. This post is what the server recorded, and what you can safely measure from it.

Two different signals

“AI traffic” usually means one of two things, and nginx logs make the difference obvious.

  • Provider-side fetch. The AI product hits the origin itself, usually with a dedicated user-agent and no referrer.
  • Real clickthrough visit. A human reads the AI answer, clicks a citation, and arrives as a normal browser with the AI product as the referrer.

Folding both into a single AI-traffic number hides the most useful distinction in the data. One is the model reaching out to read you. The other is a human reading you because the model pointed.

The probe

Custom nginx log format capturing the headers the default combined log compresses out:

log_format ai_probe escape=json
  '{'
    '"time":"$time_iso8601",'
    '"ip":"$remote_addr",'
    '"uri":"$request_uri",'
    '"status":$status,'
    '"ua":"$http_user_agent",'
    '"referer":"$http_referer",'
    '"accept":"$http_accept"'
  '}';

Each assistant got a prompt pointing at a unique query string (/?ai=chatgpt, /?ai=claude, and so on), so attributing hits was a grep, not a guess. I reran prompts across sessions so a transient cache hit would not hide the retrieval path.

What each assistant sent

All values below are copied from the probe’s log file, not paraphrased.

AI productUser-agent sentAccept headerrobots.txt first?Fetched the page?
ChatGPTChatGPT-User/1.0Chrome-style: text/html,application/xhtml+xml,...*/*;q=0.8noyes
ClaudeClaude-User/1.0*/*yesyes
PerplexityPerplexity-User/1.0(empty, no Accept sent)yes (via PerplexityBot)yes
Gemininot observed (see below)not applicablenot applicableno
Microsoft Copilotplain Chrome 135 on Linux x86_64, no distinct UAChrome-stylenoyes
Grokplain Mac Safari 26 / Chrome 143, no distinct UAChrome-stylenoyes
Meta AI (Muse Spark)meta-webindexer/1.1*/*noyes
ManusManus-User/1.0 suffix on a Chrome UAChrome-stylenoyes

A question that came up after the first version of this post was whether any assistant negotiates for text/markdown. None of them did. ChatGPT sends a full Chrome-style Accept string, Claude sends a wildcard, Perplexity sends no Accept header at all.

ChatGPT: multi-IP bursts

ChatGPT-User hits the origin from multiple source IPs inside the same burst. On a separate site I run (carticy.com), a recent 24-hour window captured ChatGPT-User requests from five distinct Azure ranges: 23.98.x.x, 20.215.x.x, 40.67.x.x, 51.8.x.x, and 51.107.x.x. This matches OpenAI’s own description of the agent in their bots documentation. If you are rate-limiting based on a single source IP, you will under-count.

Claude: robots.txt first, every time

Claude-User pulled /robots.txt before every page fetch, out of Anthropic-owned IP space in the 216.73.216.0/24 range. Redirects were followed normally. The robots precheck matches Anthropic’s behavior as documented in their crawler docs. If you want Claude to skip your site, a User-agent: Claude-User disallow is the live control. Anthropic also runs two other bots that should not be confused with this one: Claude-SearchBot (their search index) and ClaudeBot (their training crawler). Only Claude-User is the user-initiated retrieval signal.

Perplexity: direct fetch, no niceties

Perplexity-User fetched the page directly. No Accept header, no referrer. Separately, PerplexityBot (their search-indexing crawler, not the user-retrieval one) pinged /robots.txt. I captured few Perplexity retrieval runs in total, and Perplexity can answer from its own index without hitting the origin, so the safe wording is that Perplexity can retrieve live; it does not have to. See Perplexity’s bots documentation for their own framing.

Gemini: no hit, not even once

Two separate observations, which the first draft of this post incorrectly ran together.

  • Observed. Zero requests arrived from any Google user-agent during the Gemini prompt window. Gemini answered entirely from its own index; it did not perform a live provider-side fetch that reached my origin.
  • Structural. Google does not publish a retrieval-specific user-agent for Gemini. Per Google’s own crawler documentation, AI Overviews and AI Mode ground on the same Search index that Googlebot populates. If Gemini ever does live-fetch, it would arrive as Googlebot, indistinguishable from ordinary Search indexing.

The practical consequences worth stating:

  • A Googlebot hit cannot be attributed to Gemini vs classic Search from the request alone.
  • Blocking Google-Extended does not block Googlebot. It gates whether Googlebot-crawled content may be used for Gemini training and grounding.
  • Any AI-traffic dashboard built on server logs is observably asymmetric by vendor. Plan for the asymmetry; do not paper over it.

Copilot and Grok: invisible by default

Microsoft Copilot fetched the page as plain Chrome 135 on Linux x86_64, with a full browser-style Accept header and the usual burst of CSS, JS, and image requests. No distinct Copilot user-agent, no Bingbot activity during the prompt window. Per Microsoft’s guidance for generative-AI and public websites, Copilot grounds on the Bing index populated by Bingbot, but the live fetch we observed was not Bingbot. From the log operator’s side, you cannot positively attribute a Copilot fetch to Copilot by user-agent alone.

Grok fetched the page as plain Mac Safari 26 (and in a second run, plain Mac Chrome 143). No distinct UA, no suffix, no header signal that would let you attribute the hit to xAI from the request alone. Grok documents no retrieval-specific bot. Same observability problem as Copilot, with even less documentation to fall back on.

Between Gemini, Copilot, and Grok, three of the major AI products are either invisible in provider-fetch logs (Gemini, in the run we captured) or indistinguishable from an ordinary human visitor (Copilot and Grok). Any HTTP-based AI-traffic dashboard that ignores this asymmetry is reporting a partial picture.

Meta AI: two documented bots, one observed, no confident mapping

Meta AI, prompted through its Muse Spark surface, triggered a fetch from meta-webindexer/1.1 with Accept: */*. Meta’s own web-crawlers documentation describes a different bot, Meta-ExternalFetcher, as the user-initiated retrieval bot for Facebook, Messenger, Instagram, and WhatsApp AI features, and documents that it may bypass robots.txt on the grounds that a human or agent followed a specific link.

We observed one of these bots in one session. We did not observe both, and the probe cannot isolate which factors determine when each one fires: product surface, first-time vs repeat fetch, prior index state, or something else. Treat meta-webindexer and Meta-ExternalFetcher as both belonging to Meta’s retrieval-class family, and if you need to block either of them, target them explicitly by UA rather than assume a single name covers all Meta AI products.

Manus: the agent that labels itself

Manus fetched as Mozilla/5.0 ... Chrome/132.0 ... ; Manus-User/1.0. The Manus-User/1.0 suffix is the retrieval signal. Unlike the other agents tested, Manus rendered the full page: HTML, every CSS file, every JS file, every image. Of the agentic AI products in this probe, Manus is the one that labels itself clearly in the UA and is easiest to identify in logs.

What a product can track without overclaiming

Two tracking classes hold up against the logs.

Provider fetch

Vendor-documented or probe-observed retrieval user-agents hitting your origin: ChatGPT-User, Claude-User, Perplexity-User, Manus-User, Meta-ExternalFetcher (documented), and meta-webindexer (observed; Meta bot class not fully clear to us).

Real visit

Normal browser user-agent with an AI product as the referrer: chatgpt.com, claude.ai, perplexity.ai, gemini.google.com, copilot.microsoft.com, grok.com, meta.ai, and google.com / bing.com as broader buckets (with no way to isolate AI Mode or Copilot from classic Search using HTTP alone).

Search-indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Googlebot, Bingbot) are a separate signal. They are not live retrieval against a specific user query; folding them into the provider-fetch bucket turns the metric into noise. Training bots (GPTBot, ClaudeBot, CCBot) are a third separate signal, and they have no business inside a retrieval count.

Appendix: vendor-documented bot taxonomy

BotCompanyClassSource
ChatGPT-UserOpenAIretrievalplatform.openai.com/docs/bots
OAI-SearchBotOpenAIsearch_indexingplatform.openai.com/docs/bots
GPTBotOpenAItrainingplatform.openai.com/docs/bots
Claude-UserAnthropicretrievalAnthropic crawler docs
Claude-SearchBotAnthropicsearch_indexingAnthropic crawler docs
ClaudeBotAnthropictrainingAnthropic crawler docs
Perplexity-UserPerplexityretrievaldocs.perplexity.ai/guides/bots
PerplexityBotPerplexitysearch_indexingdocs.perplexity.ai/guides/bots
Meta-ExternalFetcherMetaretrieval (may bypass robots.txt)Meta web crawlers
Meta-ExternalAgentMetatraining and product indexingMeta web crawlers
meta-webindexerMetaobserved on Meta AI (Muse Spark) retrieval; class not fully clear to usMeta crawler docs
Manus-UserManusretrieval (agentic; full browser-style render)observed in this probe
GooglebotGooglesearch_indexing (also grounds AI Overviews and AI Mode)Google crawlers
Google-ExtendedGoogleusage control, not a crawler; gates Gemini training and groundingGoogle crawlers
BingbotMicrosoftsearch_indexing (also grounds Microsoft Copilot)Copilot public websites
CCBotCommon Crawltraining (used by many labs)commoncrawl.org/ccbot

Microsoft Copilot and Grok are not in this table. Neither vendor documents a retrieval-specific user-agent we can cite; the live fetches we observed from both came in as plain browsers.

Check this on your own site

Our robots.txt checker reads your live file and reports which retrieval, search, and training user-agents it currently allows or blocks. No account needed. That is the fastest way to turn the table above into one concrete answer about your domain.