Skip to content
llmoptimisation.fr

Technical

Technical optimisation for AI visibility

What AI bots see, what they don't, and how to make a site addressable. Crawl, rendering, schema.org, llms.txt, granular AI User-Agent management.

Mise à jour : 14 April 2026 14 min de lecture

The AI bots to know

Each engine family uses one or more User-Agents. Some serve training crawls, others serve real-time search crawls.

User-AgentPublisherMain use
GPTBotOpenAICrawl for training and product improvement.
OAI-SearchBotOpenAICrawl for ChatGPT Search.
ChatGPT-UserOpenAIFetch at the moment of a user query.
PerplexityBotPerplexityMain Perplexity index.
Perplexity-UserPerplexityUser-triggered fetch.
ClaudeBot / anthropic-ai / Claude-WebAnthropicCrawl for training and Claude search.
Google-ExtendedGoogleRobots directive for Gemini / Vertex training (separate from Googlebot).
Applebot-ExtendedAppleDirective for Apple Intelligence training.
BytespiderByteDanceAggressive crawl, often blocked by default.
CCBotCommon CrawlCorpus used by many open-source models.
meta-externalagentMetaCrawl for Meta AI training.
cohere-aiCohereCrawl for Cohere training.

robots.txt — the right default

A neutral, visibility-oriented configuration lets all major AI bots through. That's the stance adopted by llmoptimisation.fr — documented and owned.

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yoursite.com/sitemap-index.xml

Block or allow: a business call

Three coherent postures depending on your model:

ProfileRecommended postureReason
Marketing / SaaS / B2B servicesLet all major AI bots throughMaximise visibility and citations.
E-commerceAllow through, but protect sensitive product data (dynamic pricing, stock)Product pages carry marketing value for AI; real-time feeds shouldn't be crawled.
Paywalled or paid mediaBlock or commercial licensingPreserve content value. OpenAI, Google and Perplexity sign licenses with major publishers.
Premium, non-indexable contentExplicit blockEditorial and legal consistency.

Rendering and crawl: the JS trap

Not all AI engines execute JavaScript. Perplexity, ChatGPT Search and many crawlers rely on static HTML or a fast render. Sites that render content client-side (non-SSR React/Vue SPAs) can be partially or entirely invisible to AI.

  • Prefer SSR (Server-Side Rendering) or SSG (Static Site Generation).
  • For existing SPAs, set up pre-rendering for bots (Prerender.io, Rendertron).
  • Check that critical content is in the initial HTML, not injected after load.
  • Avoid modals loading content on demand as the only information surface.
  • Test with curl without JS: curl -A "PerplexityBot" https://yoursite.com/page.

Schema.org: what actually helps

Schema.org doesn't guarantee AI citations, but it improves disambiguation and helps Google surfaces (AI Overviews, panels). Priorities:

Performance: still and always

AI bots set execution budgets. A slow site caps the pages crawled per session. Minimum rules:

The llms.txt file

A Markdown file served at the root. It offers LLMs a curated table of contents for the site. Adoption is still limited. Real usefulness: moderate. Cost: negligible. Recommendation: publish it, don't make it a priority.

For the specification and detailed best practices, see the dedicated external resource (the sibling site llmtxt.info covers this standard in depth).

Express technical checklist

  • Explicit robots.txt with major AI bots declared.
  • WAF / Cloudflare verified: AI bots not blocked by default WAF rule.
  • SSR or SSG; critical HTML present without JS execution.
  • Clean XML sitemap, absolute canonical on every page.
  • Validated schema.org (rich results test).
  • Core Web Vitals in the green.
  • llms.txt and llms-full.txt at the root, consistent with site structure.
  • Logs monitored for GPTBot, PerplexityBot, ClaudeBot.

À lire ensuite