Technical optimisation for AI visibility

The AI bots to know

Each engine family uses one or more User-Agents. Some serve training crawls, others serve real-time search crawls.

User-Agent	Publisher	Main use
`GPTBot`	OpenAI	Crawl for training and product improvement.
`OAI-SearchBot`	OpenAI	Crawl for ChatGPT Search.
`ChatGPT-User`	OpenAI	Fetch at the moment of a user query.
`PerplexityBot`	Perplexity	Main Perplexity index.
`Perplexity-User`	Perplexity	User-triggered fetch.
`ClaudeBot` / `anthropic-ai` / `Claude-Web`	Anthropic	Crawl for training and Claude search.
`Google-Extended`	Google	Robots directive for Gemini / Vertex training (separate from Googlebot).
`Applebot-Extended`	Apple	Directive for Apple Intelligence training.
`Bytespider`	ByteDance	Aggressive crawl, often blocked by default.
`CCBot`	Common Crawl	Corpus used by many open-source models.
`meta-externalagent`	Meta	Crawl for Meta AI training.
`cohere-ai`	Cohere	Crawl for Cohere training.

robots.txt, the right default

A neutral, visibility-oriented configuration lets all major AI bots through. That's the stance adopted by llmoptimisation.fr, documented and owned.

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yoursite.com/sitemap-index.xml

Block or allow: a business call

Three coherent postures depending on your model:

Profile	Recommended posture	Reason
Marketing / SaaS / B2B services	Let all major AI bots through	Maximise visibility and citations.
E-commerce	Allow through, but protect sensitive product data (dynamic pricing, stock)	Product pages carry marketing value for AI; real-time feeds shouldn't be crawled.
Paywalled or paid media	Block or commercial licensing	Preserve content value. OpenAI, Google and Perplexity sign licenses with major publishers.
Premium, non-indexable content	Explicit block	Editorial and legal consistency.

Rendering and crawl: the JS trap

Not all AI engines execute JavaScript. Perplexity, ChatGPT Search and many crawlers rely on static HTML or a fast render. Sites that render content client-side (non-SSR React/Vue SPAs) can be partially or entirely invisible to AI.

Prefer SSR (Server-Side Rendering) or SSG (Static Site Generation).
For existing SPAs, set up pre-rendering for bots (Prerender.io, Rendertron).
Check that critical content is in the initial HTML, not injected after load.
Avoid modals loading content on demand as the only information surface.
Test with curl without JS: curl -A "PerplexityBot" https://yoursite.com/page.

Schema.org: what actually helps

Schema.org doesn't guarantee AI citations, but it improves disambiguation and helps Google surfaces (AI Overviews, panels). Priorities:

Organization, brand identity, sameAs to official profiles, logo.
WebSite + SearchAction, on the home.
Article / TechArticle, on pillar pages.
BreadcrumbList, everywhere.
FAQPage, on FAQ pages, not every page (otherwise dilution).
HowTo, on method pages structured as steps.
Product, Review, AggregateRating, e-commerce.
DefinedTerm, DefinedTermSet, glossaries.

Performance: still and always

AI bots set execution budgets. A slow site caps the pages crawled per session. Minimum rules:

LCP < 2.5 s, CLS < 0.1, INP < 200 ms.
Compressed HTML (gzip / brotli).
HTTP/2 or HTTP/3.
Modern images (AVIF / WebP), loading="lazy".
No unnecessary blocking CSS.

The llms.txt file

A Markdown file served at the root. It offers LLMs a curated table of contents for the site. Adoption is still limited. Real usefulness: moderate. Cost: negligible. Recommendation: publish it, don't make it a priority.

For the specification and detailed best practices, see the dedicated external resource (the sibling site llmtxt.info covers this standard in depth).

Express technical checklist

Explicit robots.txt with major AI bots declared.
WAF / Cloudflare verified: AI bots not blocked by default WAF rule.
SSR or SSG; critical HTML present without JS execution.
Clean XML sitemap, absolute canonical on every page.
Validated schema.org (rich results test).
Core Web Vitals in the green.
llms.txt and llms-full.txt at the root, consistent with site structure.
Logs monitored for GPTBot, PerplexityBot, ClaudeBot.

Technical optimisation for AI visibility

The AI bots to know

robots.txt, the right default

Block or allow: a business call

Rendering and crawl: the JS trap

Schema.org: what actually helps

Performance: still and always

The llms.txt file

Express technical checklist

Related reading