The AI bots to know
Each engine family uses one or more User-Agents. Some serve training crawls, others serve real-time search crawls.
| User-Agent | Publisher | Main use |
|---|---|---|
GPTBot | OpenAI | Crawl for training and product improvement. |
OAI-SearchBot | OpenAI | Crawl for ChatGPT Search. |
ChatGPT-User | OpenAI | Fetch at the moment of a user query. |
PerplexityBot | Perplexity | Main Perplexity index. |
Perplexity-User | Perplexity | User-triggered fetch. |
ClaudeBot / anthropic-ai / Claude-Web | Anthropic | Crawl for training and Claude search. |
Google-Extended | Robots directive for Gemini / Vertex training (separate from Googlebot). | |
Applebot-Extended | Apple | Directive for Apple Intelligence training. |
Bytespider | ByteDance | Aggressive crawl, often blocked by default. |
CCBot | Common Crawl | Corpus used by many open-source models. |
meta-externalagent | Meta | Crawl for Meta AI training. |
cohere-ai | Cohere | Crawl for Cohere training. |
robots.txt — the right default
A neutral, visibility-oriented configuration lets all major AI bots through. That's the stance adopted by llmoptimisation.fr — documented and owned.
User-agent: *
Allow: /
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
Sitemap: https://yoursite.com/sitemap-index.xml Block or allow: a business call
Three coherent postures depending on your model:
| Profile | Recommended posture | Reason |
|---|---|---|
| Marketing / SaaS / B2B services | Let all major AI bots through | Maximise visibility and citations. |
| E-commerce | Allow through, but protect sensitive product data (dynamic pricing, stock) | Product pages carry marketing value for AI; real-time feeds shouldn't be crawled. |
| Paywalled or paid media | Block or commercial licensing | Preserve content value. OpenAI, Google and Perplexity sign licenses with major publishers. |
| Premium, non-indexable content | Explicit block | Editorial and legal consistency. |
Rendering and crawl: the JS trap
Not all AI engines execute JavaScript. Perplexity, ChatGPT Search and many crawlers rely on static HTML or a fast render. Sites that render content client-side (non-SSR React/Vue SPAs) can be partially or entirely invisible to AI.
- Prefer SSR (Server-Side Rendering) or SSG (Static Site Generation).
- For existing SPAs, set up pre-rendering for bots (Prerender.io, Rendertron).
- Check that critical content is in the initial HTML, not injected after load.
- Avoid modals loading content on demand as the only information surface.
- Test with curl without JS:
curl -A "PerplexityBot" https://yoursite.com/page.
Schema.org: what actually helps
Schema.org doesn't guarantee AI citations, but it improves disambiguation and helps Google surfaces (AI Overviews, panels). Priorities:
Organization— brand identity,sameAsto official profiles, logo.WebSite+SearchAction— on the home.Article/TechArticle— on pillar pages.BreadcrumbList— everywhere.FAQPage— on FAQ pages, not every page (otherwise dilution).HowTo— on method pages structured as steps.Product,Review,AggregateRating— e-commerce.DefinedTerm,DefinedTermSet— glossaries.
Performance: still and always
AI bots set execution budgets. A slow site caps the pages crawled per session. Minimum rules:
- LCP < 2.5 s, CLS < 0.1, INP < 200 ms.
- Compressed HTML (gzip / brotli).
- HTTP/2 or HTTP/3.
- Modern images (AVIF / WebP),
loading="lazy". - No unnecessary blocking CSS.
The llms.txt file
A Markdown file served at the root. It offers LLMs a curated table of contents for the site. Adoption is still limited. Real usefulness: moderate. Cost: negligible. Recommendation: publish it, don't make it a priority.
For the specification and detailed best practices, see the dedicated external resource (the sibling site llmtxt.info covers this standard in depth).
Express technical checklist
- Explicit robots.txt with major AI bots declared.
- WAF / Cloudflare verified: AI bots not blocked by default WAF rule.
- SSR or SSG; critical HTML present without JS execution.
- Clean XML sitemap, absolute canonical on every page.
- Validated schema.org (rich results test).
- Core Web Vitals in the green.
- llms.txt and llms-full.txt at the root, consistent with site structure.
- Logs monitored for GPTBot, PerplexityBot, ClaudeBot.