Mechanism

How AI engines choose citations

AI engines pick citations using a stack of five signals: index reach (can the bot fetch the page), recency, schema completeness, answer-shaped chunks (sentences and lists that can be lifted verbatim), and third-party authority (who else links to or references the page). Each engine weights those signals differently, but the order of operations is consistent: a page that fails the first two signals never gets a chance to win on the rest.

Key facts

  • Citation decisions happen in two stages: retrieval (candidate set) then re-ranking (citation selection).
  • Stage 1 — retrieval — is gated by crawl reach + index recency. Most non-cited pages fail here.
  • Stage 2 — re-ranking — rewards FAQ schema, named entities, answer-first sentences, and inbound links.
  • ChatGPT uses Bing's index for live browse; Perplexity browses live; Gemini leans on Google's index.
  • Claude and Copilot fall between — both browse live and weight schema + entity attribution heavily.

What happens in stage 1 — retrieval?

When you ask Perplexity or ChatGPT a question, the engine first builds a candidate set of 10–50 URLs from an underlying web index plus, often, a live browse. The candidate set is gated by two simple things: is the page in the engine's upstream index (Bing for ChatGPT, Google for Gemini, a mix for Perplexity and Claude), and was it crawled recently enough to be considered current?

Most pages that never get cited fail here. They aren't excluded by a quality judgement — they're absent from the candidate set entirely. Crawler reach (GPTBot, ClaudeBot, PerplexityBot, Bingbot, Google-Extended) is the floor of GEO visibility. See /wire for the distribution stack PressGEO uses to push new content into every relevant index within hours.

What happens in stage 2 — re-ranking?

Once the candidate set exists, the engine picks the 3–5 URLs whose content can be most cleanly stitched into an answer. Five features dominate that selection:

  1. Schema completeness — FAQPage, Article, Speakable, BreadcrumbList. Pages with all four are cited more often than pages with one.
  2. Answer-shaped chunks — first sentence is the literal answer; lists and tables are present; paragraphs make one claim each.
  3. Named entities with credentials — quoted speech from named people with verifiable expertise lifts cite rates.
  4. Numeric specificity with sources — every stat carries (provider, date). Sourceless numbers are often dropped.
  5. Third-party signal — inbound links, mentions in other indexed pages. This is why backlinks still matter for GEO, just for a different reason than SEO.

How do the engines differ?

ChatGPT (live browse via OAI-SearchBot) leans on Bing's index plus its own re-ranker. Pages indexed by Bingbot within 30 days, with clean schema and a clear answer-first lede, dominate citation slots.

Perplexity browses live on most prompts and explicitly preferences FAQ schema, dates, and inbound links. It also disproportionately cites pages mirrored to durable URLs (Wayback Machine, GitHub Releases).

Gemini uses Google's index, so Search Console health directly affects citation eligibility. Speakable and E-E-A-T signals (author credentials, peer-reviewed sources) matter more here than on other engines.

Claude (ClaudeBot) crawls aggressively and weights named-entity attribution heavily. Pages without quoted speakers are routinely passed over.

Copilot mirrors ChatGPT's behavior closely because both lean on Bing. Grok over-indexes on recency and social signal — fresh dates and inbound mentions move it most.

How do I diagnose why I'm not being cited?

Three checks, in order:

  1. Are GPTBot, ClaudeBot, PerplexityBot, Bingbot, and Google-Extended fetching your pages? Check your server or CDN logs. If counts are zero, fix indexation first.
  2. Do your top pages score above 70 on the free GEO Score tool? If not, fix the highest-impact failing rules.
  3. Are you monitoring the right prompts? PressGEO's /proof page and weekly research publish the prompt set we track.
Almost every team we audit thinks they have a writing problem. Most of them have a crawl problem first. If GPTBot hasn't fetched the page in 30 days, the lede doesn't matter.
PressGEO Research, Editorial teamFrom 200+ customer GEO audits, Q2 2026 (PressGEO internal data)

Frequently asked questions

How does ChatGPT decide which sources to cite?
ChatGPT (and OAI-SearchBot) lean heavily on Bing's index for live browse, then re-rank using their own relevance and trust signals. Pages indexed by Bing within the last 30 days with clean schema, named-entity attribution, and a question-shaped first sentence are disproportionately likely to be cited.
How does Perplexity choose citations?
Perplexity does a live browse on most prompts and explicitly preferences pages with FAQPage and Article JSON-LD, recent dates, and external inbound links. It also favors pages with structured comparison tables and numbered lists because they're trivially chunked into answer paragraphs.
Does Google Gemini use Search Console signals?
Yes — pages indexed and ranking well in Google Search are the candidate set for Gemini citations, with additional weighting for Speakable schema, E-E-A-T signals (author credentials, references to peer-reviewed sources), and recency. Google-Extended is the crawler-level opt-in.
What's the single highest-leverage change to earn more AI citations?
Add an answer-first lede plus a FAQPage JSON-LD block to your top 5 pages. In PressGEO's internal data, this single pairing moves citation rate more than any other change a team can make in one afternoon.
How do I know if AI crawlers are actually fetching my site?
Check your server logs (or your CDN's bot analytics) for GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, and Bingbot. If those bots show zero fetches over the last 30 days, no amount of content investment will earn citations — the engines literally can't see you.

Last updated: June 5, 2026 · By PressGEO Research