Before GEO, Check Whether AI Crawlers Can Reach Your Site

A lot of AI visibility work starts too late.

Teams jump to prompts, schema, llms.txt, citations, reviews, or GEO tactics before checking a simpler problem: can the AI systems they care about reach the pages that explain the brand?

If product pages, docs, pricing pages, or comparison pages are blocked, unstable, hidden behind JavaScript, or failing for crawler-like requests, the site may have an access problem before it has a content problem. A better prompt set will not fix a page that useful systems cannot fetch or understand.

This is the basic check to run before the more exciting AI search optimization work.

AI visibility starts before the answer

AI visibility is usually discussed at the answer layer: whether ChatGPT, Perplexity, Gemini, Claude, Google AI Overviews, or Google AI Mode mentions, cites, compares, or recommends a brand.

Before a system can use a page as evidence, something has to access it. That could be a traditional search crawler, an AI crawler, a retrieval system, a browser agent, or a user-triggered assistant that fetches a page on demand.

Search Engine Land’s AI crawler guide calls out practical issues like rendering, internal linking, server errors, and access controls. That framing is useful because it keeps the conversation grounded. AI visibility depends on what you publish and whether the right systems can fetch and interpret the pages that matter.

This does not mean every AI crawler should be allowed everywhere. It means you should know what your current setup allows, blocks, and breaks.

Do not treat every AI crawler the same

One common mistake is talking about AI crawlers as if they all do the same job.

They do not. Some bots are used for model training. Some fetch content for search or answer generation. Some visit because a user asked an assistant to open or summarize a page. Some agents browse more like a person than a classic crawler.

The right policy may be different for each case.

You may choose to block some training crawlers.
Search, answer, or user-triggered fetchers may be worth allowing on public pages.
Docs and product pages may need to stay reachable while private, paid, duplicate, or low-value areas stay restricted.
Assistant referrals should be logged separately from bot requests when possible.

The goal is deliberate access. The weaker setup is inheriting accidental rules from a CDN, security plugin, old robots.txt file, or bot-protection setting.

The access problems that actually matter

Most crawler problems are small technical issues that quietly make useful pages harder to reach.

robots.txt rules: important paths may be blocked intentionally or accidentally.
403 responses: security tools may block crawlers or user-triggered fetchers that look unfamiliar.
WAF and CDN rules: bot protection can block useful requests along with abusive traffic.
Server errors: pages that work in a browser may fail under crawler-like requests.
JavaScript-only content: the main text may not be present in the initial HTML.
Weak internal links: important pages can stay buried if few pages link to them.
Canonical confusion: systems may treat a different URL as the primary version.
Redirect chains: old URLs, chained redirects, and dead pages can waste fetches and create failures.

None of these issues proves that AI visibility will be weak. They do create blind spots when the affected URLs are the pages that explain your product, pricing, use cases, documentation, or comparisons.

Start with the pages that explain the business

An audit does not need to start with the whole site. Start with the pages an AI system would need in order to understand and describe the brand well.

Homepage
Product pages
Use-case pages
Pricing page
Documentation
Integration pages
Comparison pages
Customer proof, case studies, or review pages
Support pages that explain setup, limitations, or requirements

Then check whether those pages are reachable and useful when fetched outside a normal browser session.

Do they return 200 status codes?
Are they blocked by robots.txt?
Do security tools block known crawler or AI-related user agents?
Is the main content visible in the HTML?
Are the pages internally linked from other important pages?
Do server logs show useful crawlers or AI-related user agents requesting them?
Do failed requests cluster around important URLs?

You can use the free SurfacedBy robots.txt checker as a quick first pass for robots.txt rules, then follow up with logs, status-code checks, and your CDN or WAF settings.

robots.txt is only the first layer

robots.txt is often where teams start, and that makes sense. It is the clearest place to see whether a compliant crawler is being told not to fetch a path.

A page can be allowed in robots.txt and still fail. A WAF rule, CDN bot setting, 403 response, login wall, broken redirect, or JavaScript rendering issue can stop useful content from being fetched or interpreted.

The opposite can also happen. A page may look fine in a browser while being blocked or degraded for crawler-like requests.

Robots.txt shows the intended access policy. Logs show what actually happened.

Access does not guarantee visibility

Allowing crawlers gives your content a chance to be considered. The page still has to be useful, clear, current, and supported by credible evidence elsewhere. AI systems may still prefer competitor sources, review pages, documentation, forums, videos, or comparison pages.

Crawler access is a prerequisite, not the whole strategy. We covered the broader work in How to Optimize for AI Search Without Falling for GEO Hacks.

Blocking can be the right choice

Blocking is not always a mistake.

You may choose to restrict private content, paid content, duplicate pages, staging URLs, internal search results, user-generated pages with moderation risk, or content you do not want certain systems to use.

The problem is accidental blocking. If important public pages are blocked without anyone noticing, the site may be making a visibility tradeoff without choosing it.

A better policy is deliberate: decide which systems you care about, which pages should be reachable, which areas should stay restricted, and how you will monitor the result.

A practical AI crawler access checklist

List the pages that explain your brand, product, pricing, docs, use cases, comparisons, and proof.
Check robots.txt for those paths.
Check whether those URLs return clean 200 responses.
Review WAF, CDN, firewall, and bot-protection rules.
Confirm meaningful content appears in the initial HTML.
Review internal links to important pages.
Check server logs for known search and AI-related user agents.
Separate human AI referrals from bot requests.
Decide which crawlers you intentionally allow or restrict.
Track whether answer visibility, citations, and competitor patterns change after access fixes.

Where SurfacedBy fits

SurfacedBy helps teams track how AI systems mention, cite, compare, and recommend their brand. For WordPress sites, SurfacedBy can also help detect confirmed AI bots and AI referrers server-side, so crawler activity and human AI-assisted visits are easier to separate from normal traffic.

You can also use the free robots.txt checker to check whether important paths are being blocked before you invest in deeper GEO work.

The value is seeing access, answers, competitors, citations, traffic, and conversions together instead of treating crawler visits as a standalone win.

The bottom line

Before you worry about GEO tactics, check whether AI systems can reach the pages that explain your brand.

Access will not guarantee citations, rankings, or recommendations. If important pages are blocked, broken, hidden, or invisible to the systems you care about, the rest of your AI visibility work starts on weak ground.