Where Do AI Models Get Their Information? Understanding AI Citation Sources
Brands trying to improve their AI visibility often ask the wrong question: “How do I get ChatGPT to mention me?” The better question is: “Where does ChatGPT get its information, and is my content there?”
Contents
Two Source Types: Training Data vs. RAG
Every AI model uses some combination of two mechanisms: foundation knowledge (what's baked into model weights during training) and RAG retrieval (real-time search and context injection). Understanding which mechanism a platform uses for a given query determines what you need to optimize.
Foundation Model Knowledge
Baked into model weights during training. Can't be changed after cutoff date without retraining.
RAG Retrieval
Real-time search at inference time. Augments model response with fresh retrieved content.
The practical implication: optimizing for foundation model knowledge requires getting your content into sources the model was trained on (Wikipedia, major publications, G2/Capterra, GitHub). Optimizing for RAG means traditional web quality signals — freshness, authority, structured content, strong backlinks.
ChatGPT: Training Data + Optional Browsing
ChatGPT's base knowledge comes from its training data cutoff (April 2024 for GPT-4o). For queries that require current information or web access, it can invoke a browsing tool — but users must be on ChatGPT Plus/Pro and browsing must be enabled. Most chatbot queries use foundation knowledge only.
What ChatGPT was trained on
OpenAI has never published a complete data composition report, but based on technical papers and audits, GPT-4's training data included:
- •Common Crawl — a massive web crawl covering hundreds of billions of pages (filtered versions)
- •WebText2 — Reddit outbound links with 3+ upvotes, covering ~40GB of web content
- •Books1 and Books2 — digitized book collections
- •Wikipedia — English and multilingual Wikipedia, a high-weight source
- •GitHub — code repositories (significant weight for technical topics)
- •Academic papers via Semantic Scholar and arxiv
- •News archives, review sites, and professional databases
ChatGPT browsing (when enabled)
When ChatGPT uses its browsing tool, it executes Bing searches and visits retrieved pages. For browsing queries, the same signals that matter in Bing SEO — domain authority, page speed, structured content, backlink profile — influence what gets cited. The difference from Perplexity: ChatGPT's browsing is less transparent; it doesn't always show citations even when it uses them.
Brand implication: Getting mentioned frequently in high-authority web content before a training cutoff is foundational. Wikipedia inclusion, G2 reviews, and media coverage all contribute to foundation knowledge. For ChatGPT web-search queries, modern SEO signals carry over.
Perplexity: Real-Time Search with Citations
Perplexity is the most transparent of the major AI platforms — it shows exactly which URLs it cites. This makes it uniquely useful for understanding AI citation patterns. Every response includes numbered citations linking to source pages.
Perplexity uses multiple search engines simultaneously (Google, Bing, Brave Search) and retrieves the top results. It then synthesizes content from those pages into a response. The citation selection process favors:
- •Pages ranking in the top 5 for the search query on Google/Bing
- •Content that directly answers the query — FAQ-structured pages extract more cleanly
- •High domain authority sources (Perplexity weights G2, TechCrunch, Forbes-tier sites heavily)
- •Pages with structured data that help Perplexity understand content context
- •Fresh content — Perplexity prefers pages updated within the last 12 months for most topics
For Perplexity specifically, AEO and SEO overlap the most. A page ranking #3 on Google for a commercial query will typically get cited by Perplexity. Use Surfaced to monitor which of your pages Perplexity actually links to — this reveals exactly which content is getting extracted and synthesized.
Gemini: Google Index Integration
Gemini 2.0 has deep integration with Google Search. For commercial and informational queries, it retrieves content from the Google index in real-time. Google AI Overviews (the AI summary at the top of search results) uses the same underlying mechanism.
Gemini source selection reflects Google's ranking signals with some differences:
Claude: Training Data Focused
Claude 3.7 (Anthropic's current model) is primarily foundation-model based — it doesn't have persistent web search in its default configuration. Responses draw from training data, making the training data composition and cutoff the primary levers for brand visibility.
Anthropic trains Claude on a curated web dataset with strong emphasis on factual accuracy, harmlessness, and helpfulness. Key characteristics:
- •Curated quality over quantity — Claude's training prioritizes authoritative, accurate sources over sheer volume
- •Strong academic and long-form content weighting — detailed articles and research get higher relative weight
- •Technical documentation scores well — Claude is particularly strong at developer-focused content
- •Recent training cutoff (early 2025) — more current knowledge than GPT-4's April 2024 cutoff
- •Constitutional AI filtering — content associated with manipulation, misinformation, or harmful patterns gets downweighted
For brands targeting developer and technical audiences, Claude citations are particularly valuable. Claude users are typically more sophisticated buyers making high-value decisions.
What Content Types Get Cited Most
Based on citation pattern analysis across Surfaced's monitored brand set, these content types have the highest AI citation rates — ranked from most to least cited.
Why Freshness Matters
For RAG-based systems, freshness directly affects citation probability. Perplexity's retrieval system applies a freshness preference — pages updated recently get a boost in selection. Google AI Overviews follows Google's freshness algorithm, which applies a stronger freshness signal to certain query categories.
Query categories where freshness is critical:
- •Pricing queries — outdated pricing creates AI hallucinations and user confusion
- •Feature comparison queries — last-updated timestamp signals whether information is current
- •News and announcements — time-sensitive by nature
- •Best-of lists and rankings — annual updates prevent stale recommendations
- •Tutorial and how-to content — product UIs change; outdated screenshots hurt credibility
Practical rule: any content targeting high-competition commercial queries should be reviewed and updated at minimum every 6 months. Add a visible “Last updated” date to pages — this signals freshness to both users and AI retrievers.
How to Position Content for AI Citation
Content that gets cited by AI models shares a common structure: it's directly quotable, clearly sourced, and answers a specific question. The goal is to make your content the easiest answer for the AI to extract and cite.
Use Surfaced to test which of your pages are being cited across different AI platforms. The citation patterns reveal exactly which content formats and topics are resonating with each model's retrieval logic — and which pages need restructuring to improve citation rates.
Frequently Asked Questions
Can I submit my content directly to AI models for training?
Not directly. OpenAI, Anthropic, and Google don't accept direct training data submissions from brands. The path to training data inclusion is through the sources these models crawl: Wikipedia, major publications, G2/Capterra, GitHub, and high-authority web pages. Focus on getting your content into those sources.
Does robots.txt affect AI model crawling?
Yes. OpenAI, Anthropic, and Google have their own crawler user agents (GPTBot, ClaudeBot, Googlebot-AI). You can disallow these in robots.txt, but doing so means you won't be cited. Most brands should allow AI crawlers. Check your robots.txt for GPTBot and ClaudeBot disallow rules — many were set during the 2023 AI opt-out wave and never revisited.
Why does Perplexity cite different sources than ChatGPT for the same query?
Different retrieval mechanisms. Perplexity does real-time search and cites what it finds today. ChatGPT (without browsing) uses training data from its cutoff. The same brand might appear in Perplexity but not ChatGPT if their content is fresh but wasn't prominent during ChatGPT's training period.
How do I know which AI models are citing my brand?
Manual spot-checking across platforms is a start, but it doesn't scale past a handful of queries. Surfaced automates this — tracking brand mentions and citations across 13 AI platforms on a scheduled query set, so you see exactly which models cite you and which don't.
See which AI models cite your brand
Surfaced tracks citations across ChatGPT, Perplexity, Gemini, Claude, and 9 more — so you know exactly where you're being sourced and where you're missing.
Get Started →