Core principle: write for retrieval
AI engines in search mode (ChatGPT Search, Perplexity, AI Overviews) work in two steps: a retrieval stage that pulls relevant passages from a corpus, then a generation stage that synthesises a response while citing those passages. Optimising for retrieval means making each of your paragraphs readable out of context.
Chunking: the granularity that matters
Retrieval systems split documents into chunks of a few hundred to a few thousand characters. Chunk boundaries often follow HTML structure (headings, paragraphs).
| HTML component | Role in chunking | Best practice |
|---|---|---|
| H2 | Hard boundary | One H2 = one distinct intent, with its implicit long-tail query. |
| H3 | Secondary boundary | Sub-question or sub-aspect, never decorative. |
| Paragraph | Typical chunk unit | 3 to 6 lines. One idea per paragraph. |
| List | Near-extractable as-is | Standalone items, no "see above" references. |
| Table | Extracts very well | Clear headers, short cells, avoid merged cells. |
Standalone passages: test each one
Simple test: copy any paragraph of your page and paste it into an empty message to a colleague. If the paragraph stays understandable, it's standalone.
- Avoid pronouns without an antecedent ("it enables..." mid-page).
- Re-name the main entities at the start of each section.
- Define acronyms at their first local occurrence, not only at page top.
- Date time-bound statements ("in 2026", not "this year").
Citation-friendly content
A cited passage is one the model can display with confidence. It has three traits:
- A sharp claim — "Google AI Overviews rolled out broadly in 2025" is citable. "AI is changing SEO" isn't.
- Minimum context — who, what, when. No ambiguity on the subject.
- Verifiability — an external source, a published datum, an author.
Entities and disambiguation
LLMs bind your content to entities. If your brand shares its name with something else (a plant, a person, another company), disambiguation is priority one. Techniques:
- Systematic co-occurrence with domain markers: sector, product, customer segment.
- Foundational links to Wikipedia, Wikidata, the official LinkedIn, the canonical site, via
sameAsonOrganizationschema. - Factual biography on an About page with dates, places, activities, sources.
- Editorial consistency: same tone, same terminology across site and adjacent channels (LinkedIn, press, podcasts).
Anatomy of a GEO page
- H1 — primary query, 6 to 12 words, no superlatives.
- Lede — 2 to 4 sentences that already answer the question. First sentence standalone.
- Dates — publication + last update, visible.
- H2 "In brief" — 3 to 5 bullets, each citable as-is.
- Body — 5 to 8 H2 sections covering sub-intents.
- Table or checklist — at least one dense, extractable element.
- Contextual FAQ — 3 to 6 local (not generic) questions.
- Outbound linking — 3 to 6 internal contextual links, 1 to 3 external source links.
- Author and organisation — schema.org
Article+Organization.
Length, format, density
There's no magic length. A page must cover its subject, not hit a word quota. Benchmarks:
- Pillar: 2,000 to 4,000 words, 6 to 10 H2s.
- Satellite: 800 to 1,500 words, 3 to 5 H2s.
- FAQ / definition: 400 to 800 words, standalone answers.
Common mistakes observed
- Walls of text — 15-line paragraphs, invisible in retrieval.
- Decorative H2s: "Conclusion", "Introduction", "Learn more" — carry zero query.
- JSON-LD schemas inconsistent with visible content (missing author, fake date, wrong type).
- Unreviewed AI-generated content, stacking empty phrasing.
- Cross-page duplications that dilute authority.
- Long conditional sentences saying nothing citable.
Express checklist
- Each H2 carries a clear intent and reformulates a query.
- Each paragraph can be read in isolation.
- Every numerical claim is dated and sourced.
- Every acronym is defined at first occurrence.
- The page contains at least one table or checklist.
- The page carries a visible update date.
- Internal linking goes out to at least 3 other pages on the site.
- schema.org structured data is validated.