Back to Blog
EducationMarch 25, 2026·11 min read

Where Do AI Models Get Their Information? Understanding AI Citation Sources

Brands trying to improve their AI visibility often ask the wrong question: “How do I get ChatGPT to mention me?” The better question is: “Where does ChatGPT get its information, and is my content there?”

Two Source Types: Training Data vs. RAG

Every AI model uses some combination of two mechanisms: foundation knowledge (what's baked into model weights during training) and RAG retrieval (real-time search and context injection). Understanding which mechanism a platform uses for a given query determines what you need to optimize.

Foundation Model Knowledge

Baked into model weights during training. Can't be changed after cutoff date without retraining.

Sources: web crawls, Wikipedia, books, news archives, review sites, GitHub, Reddit

RAG Retrieval

Real-time search at inference time. Augments model response with fresh retrieved content.

Sources: live web search, user-uploaded documents, connected databases, browsing

The practical implication: optimizing for foundation model knowledge requires getting your content into sources the model was trained on (Wikipedia, major publications, G2/Capterra, GitHub). Optimizing for RAG means traditional web quality signals — freshness, authority, structured content, strong backlinks.

ChatGPT: Training Data + Optional Browsing

ChatGPT's base knowledge comes from its training data cutoff (April 2024 for GPT-4o). For queries that require current information or web access, it can invoke a browsing tool — but users must be on ChatGPT Plus/Pro and browsing must be enabled. Most chatbot queries use foundation knowledge only.

What ChatGPT was trained on

OpenAI has never published a complete data composition report, but based on technical papers and audits, GPT-4's training data included:

  • Common Crawl — a massive web crawl covering hundreds of billions of pages (filtered versions)
  • WebText2 — Reddit outbound links with 3+ upvotes, covering ~40GB of web content
  • Books1 and Books2 — digitized book collections
  • Wikipedia — English and multilingual Wikipedia, a high-weight source
  • GitHub — code repositories (significant weight for technical topics)
  • Academic papers via Semantic Scholar and arxiv
  • News archives, review sites, and professional databases

ChatGPT browsing (when enabled)

When ChatGPT uses its browsing tool, it executes Bing searches and visits retrieved pages. For browsing queries, the same signals that matter in Bing SEO — domain authority, page speed, structured content, backlink profile — influence what gets cited. The difference from Perplexity: ChatGPT's browsing is less transparent; it doesn't always show citations even when it uses them.

Brand implication: Getting mentioned frequently in high-authority web content before a training cutoff is foundational. Wikipedia inclusion, G2 reviews, and media coverage all contribute to foundation knowledge. For ChatGPT web-search queries, modern SEO signals carry over.

Perplexity: Real-Time Search with Citations

Perplexity is the most transparent of the major AI platforms — it shows exactly which URLs it cites. This makes it uniquely useful for understanding AI citation patterns. Every response includes numbered citations linking to source pages.

Perplexity uses multiple search engines simultaneously (Google, Bing, Brave Search) and retrieves the top results. It then synthesizes content from those pages into a response. The citation selection process favors:

  • Pages ranking in the top 5 for the search query on Google/Bing
  • Content that directly answers the query — FAQ-structured pages extract more cleanly
  • High domain authority sources (Perplexity weights G2, TechCrunch, Forbes-tier sites heavily)
  • Pages with structured data that help Perplexity understand content context
  • Fresh content — Perplexity prefers pages updated within the last 12 months for most topics

For Perplexity specifically, AEO and SEO overlap the most. A page ranking #3 on Google for a commercial query will typically get cited by Perplexity. Use Surfaced to monitor which of your pages Perplexity actually links to — this reveals exactly which content is getting extracted and synthesized.

Gemini: Google Index Integration

Gemini 2.0 has deep integration with Google Search. For commercial and informational queries, it retrieves content from the Google index in real-time. Google AI Overviews (the AI summary at the top of search results) uses the same underlying mechanism.

Gemini source selection reflects Google's ranking signals with some differences:

Google Search Rankings
The most direct influence — pages ranking well on Google are the candidate pool for Gemini synthesis.
E-E-A-T
Experience, Expertise, Authoritativeness, Trustworthiness — Gemini weights these quality signals heavily for health, finance, and professional topics.
Structured Data
Schema markup helps Gemini parse page content accurately. FAQ, HowTo, and Product schema improve extraction.
Google Business Profile
For local queries, Gemini pulls directly from GBP data — reviews, hours, categories, attributes.
Featured Snippet Eligibility
Pages winning featured snippets are frequently the same pages Gemini synthesizes. Optimize for both simultaneously.

Claude: Training Data Focused

Claude 3.7 (Anthropic's current model) is primarily foundation-model based — it doesn't have persistent web search in its default configuration. Responses draw from training data, making the training data composition and cutoff the primary levers for brand visibility.

Anthropic trains Claude on a curated web dataset with strong emphasis on factual accuracy, harmlessness, and helpfulness. Key characteristics:

  • Curated quality over quantity — Claude's training prioritizes authoritative, accurate sources over sheer volume
  • Strong academic and long-form content weighting — detailed articles and research get higher relative weight
  • Technical documentation scores well — Claude is particularly strong at developer-focused content
  • Recent training cutoff (early 2025) — more current knowledge than GPT-4's April 2024 cutoff
  • Constitutional AI filtering — content associated with manipulation, misinformation, or harmful patterns gets downweighted

For brands targeting developer and technical audiences, Claude citations are particularly valuable. Claude users are typically more sophisticated buyers making high-value decisions.

What Content Types Get Cited Most

Based on citation pattern analysis across Surfaced's monitored brand set, these content types have the highest AI citation rates — ranked from most to least cited.

1
Review Aggregator Listings
G2, Capterra, TrustRadius — cited in nearly every software recommendation query
All models
2
Wikipedia Pages
Highest single-source weight in training data; cited for factual/background queries
ChatGPT, Claude, Gemini
3
FAQ-Structured Pages
Easy to extract; directly matches query intent; FAQ schema amplifies this
All models (esp. Perplexity)
4
Comparison/Alternative Pages
High intent match for evaluation-stage queries; frequently retrieved verbatim
All models
5
Official Documentation
Technical docs, API references, integration guides — trusted for specifics
Perplexity, Claude, Gemini
6
Data Studies & Surveys
Original statistics get cited as facts; high citation longevity
All models
7
Long-Form Guides (2000+ words)
Comprehensive coverage signals authority; multiple sections get extracted independently
ChatGPT, Claude
8
Press & News Coverage
TechCrunch, Forbes, Wired, industry trades — credibility signals
ChatGPT, Perplexity

Why Freshness Matters

For RAG-based systems, freshness directly affects citation probability. Perplexity's retrieval system applies a freshness preference — pages updated recently get a boost in selection. Google AI Overviews follows Google's freshness algorithm, which applies a stronger freshness signal to certain query categories.

Query categories where freshness is critical:

  • Pricing queries — outdated pricing creates AI hallucinations and user confusion
  • Feature comparison queries — last-updated timestamp signals whether information is current
  • News and announcements — time-sensitive by nature
  • Best-of lists and rankings — annual updates prevent stale recommendations
  • Tutorial and how-to content — product UIs change; outdated screenshots hurt credibility

Practical rule: any content targeting high-competition commercial queries should be reviewed and updated at minimum every 6 months. Add a visible “Last updated” date to pages — this signals freshness to both users and AI retrievers.

Why Authoritative Sources Win

AI models are trained to produce accurate, trustworthy responses. To do that, they weight authoritative sources more heavily. Authority isn't just domain rating — it's a combination of signals that correlate with information quality.

Authority signals vary by platform. For training-data-based models (ChatGPT, Claude), editorial authority matters most — being cited by other authoritative sources. For RAG-based systems (Perplexity, Gemini), computational authority signals matter — domain rating, backlink profile, click-through signals.

Authority TypeHow to Build It
Editorial authorityGet cited by Wikipedia, major publications, academic sources
Review authority200+ G2/Capterra reviews with high ratings
Domain authorityEarn backlinks from DA 50+ sites in your niche
Entity authorityConsistent, complete presence across all authoritative directories
Content authorityComprehensive, accurate, frequently-updated content

How to Position Content for AI Citation

Content that gets cited by AI models shares a common structure: it's directly quotable, clearly sourced, and answers a specific question. The goal is to make your content the easiest answer for the AI to extract and cite.

Lead with the answer
Put the core answer in the first 2 sentences. AI models extract the most informative opening paragraph. Don't bury the answer below 300 words of context.
Use exact question phrasing as headers
H2s and H3s formatted as questions (‘How does X work?’, ‘What is the cost of Y?’) map directly to query intent and improve extraction accuracy.
Include specific numbers
Vague claims get paraphrased; specific statistics get quoted verbatim. '67% of users reported...' is more citable than 'many users said...'
Add visible attribution
Name your sources: 'According to Gartner's 2025 survey...', 'Per OpenAI's technical report...'. Attributed claims are more trusted by AI models.
Implement schema markup
FAQ schema, HowTo schema, and Article schema help AI parsers understand your content structure. Schema-marked content gets extracted at higher rates in RAG retrieval.

Use Surfaced to test which of your pages are being cited across different AI platforms. The citation patterns reveal exactly which content formats and topics are resonating with each model's retrieval logic — and which pages need restructuring to improve citation rates.

Frequently Asked Questions

Can I submit my content directly to AI models for training?

Not directly. OpenAI, Anthropic, and Google don't accept direct training data submissions from brands. The path to training data inclusion is through the sources these models crawl: Wikipedia, major publications, G2/Capterra, GitHub, and high-authority web pages. Focus on getting your content into those sources.

Does robots.txt affect AI model crawling?

Yes. OpenAI, Anthropic, and Google have their own crawler user agents (GPTBot, ClaudeBot, Googlebot-AI). You can disallow these in robots.txt, but doing so means you won't be cited. Most brands should allow AI crawlers. Check your robots.txt for GPTBot and ClaudeBot disallow rules — many were set during the 2023 AI opt-out wave and never revisited.

Why does Perplexity cite different sources than ChatGPT for the same query?

Different retrieval mechanisms. Perplexity does real-time search and cites what it finds today. ChatGPT (without browsing) uses training data from its cutoff. The same brand might appear in Perplexity but not ChatGPT if their content is fresh but wasn't prominent during ChatGPT's training period.

How do I know which AI models are citing my brand?

Manual spot-checking across platforms is a start, but it doesn't scale past a handful of queries. Surfaced automates this — tracking brand mentions and citations across 13 AI platforms on a scheduled query set, so you see exactly which models cite you and which don't.

See which AI models cite your brand

Surfaced tracks citations across ChatGPT, Perplexity, Gemini, Claude, and 9 more — so you know exactly where you're being sourced and where you're missing.

Get Started →