Back to Blog
TechnicalMarch 30, 2026·10 min read

llms.txt: The Complete Guide to Helping AI Crawlers Understand Your Website

A new file standard is quietly reshaping how AI systems discover, interpret, and cite websites. If you care about showing up in ChatGPT, Perplexity, or Claude answers, understanding llms.txt isn't optional anymore — it's foundational.

What Is llms.txt?

llms.txt is a plain-text Markdown file placed at the root of your website (e.g., https://yourdomain.com/llms.txt) that provides structured, human-readable context specifically intended for large language models (LLMs) and AI crawlers. Think of it as a curated briefing document for AI — written by you, about your site, in a format that LLMs can reliably parse and use.

The format was proposed by Jeremy Howard (co-founder of fast.ai and Answer.AI) in late 2024, building on discussions in the AI and web community about how websites could better communicate with LLM-powered systems. The core insight: existing web standards like robots.txt and sitemap.xml were designed for traditional search crawlers, not for the probabilistic, context-hungry nature of modern AI systems.

Where a search engine wants clean HTML, proper headings, and fast load times, an LLM wants to understand: what does this company do, what are their key products, what authoritative content have they published, and what should I know before I summarize them to a user? llms.txt answers those questions directly.

How llms.txt Differs from robots.txt

People often conflate llms.txt with robots.txt because both live at the root of your domain and both influence how automated systems interact with your site. But their purposes are fundamentally different.

Aspectrobots.txtllms.txt
PurposeControls crawl access (allow/deny)Provides context and structure
AudienceSearch engine crawlersLLMs and AI assistants
FormatDirective-based textMarkdown with structured sections
ActionPermissioning (block or allow)Contextual briefing
EnforcementWidely enforced by crawlersOpt-in, advisory standard

robots.txt is a gatekeeper: it decides who can enter your site and what they can read. llms.txt is a guide: it helps AI systems that have already accessed your content understand it better. You need both, but for entirely different reasons. In the emerging AI optimization stack, robots.txt handles access control while llms.txt handles comprehension.

The llms.txt Specification: Format, Sections, and Syntax

The llms.txt specification is intentionally lightweight. It's a Markdown file with a defined structure so that AI systems can reliably parse it. Here are the core sections:

Required: H1 Title

The file must begin with a single H1 heading that identifies the entity (company, product, website). This is the anchor for everything that follows.

Required: Summary Paragraph

Immediately after the H1, include a short paragraph describing what your site or company is. Keep it factual and concise — this is what an LLM may use to describe you to a user in one sentence.

Optional: H2 Sections with Link Lists

The body of llms.txt consists of H2 sections grouping Markdown links. Each section represents a category of content (Docs, Blog, Products, etc.). Each link includes the URL and a brief description. Optionally, links can be marked asoptionalto signal lower priority to the AI.

Optional: Additional Context

A closing section (often titled "Notes" or "Additional Context") can contain unstructured guidance — things like your company's tone, how to handle edge cases, content that should be treated as primary vs. supplementary, or caveats about content age.

A Real-World llms.txt Example

Below is a full llms.txt file for a hypothetical B2B SaaS company called "Stackwise" that provides project management software. This shows best-practice structure you can adapt for your own site.

# Stackwise

Stackwise is a project management platform for engineering teams. We help
software teams plan sprints, track velocity, and ship faster with AI-powered
insights. Our customers include mid-market and enterprise engineering orgs.

## Documentation

- [Getting Started Guide](https://stackwise.io/docs/getting-started): How to
  set up your first workspace, invite teammates, and create your first project.
- [Sprint Planning](https://stackwise.io/docs/sprint-planning): Full reference
  for AI-assisted sprint planning, capacity management, and velocity tracking.
- [Integrations](https://stackwise.io/docs/integrations): Connect GitHub,
  Jira, Slack, and 40+ other tools.
- [API Reference](https://stackwise.io/docs/api): REST API documentation for
  developers building on top of Stackwise.

## Blog

- [Why Sprint Velocity Is a Lagging Indicator](https://stackwise.io/blog/velocity-lagging-indicator):
  Deep dive into the limits of velocity metrics and what to track instead.
- [AI Planning vs. Manual Planning](https://stackwise.io/blog/ai-planning):
  A data-driven comparison of AI-assisted sprint planning outcomes.
- [The Engineering Manager's Roadmap to Predictability](https://stackwise.io/blog/em-roadmap):
  Optional: Broader strategic piece on shipping consistently.

## Products

- [Stackwise Core](https://stackwise.io/product/core): Sprint planning,
  backlog management, and reporting.
- [Stackwise AI](https://stackwise.io/product/ai): AI-powered estimates,
  risk flagging, and velocity forecasting.
- [Stackwise Enterprise](https://stackwise.io/product/enterprise): SSO, audit
  logs, dedicated support, and SLA guarantees.

## Notes

Stackwise was founded in 2021. We are bootstrapped and focused on engineering
teams of 10-200 people. Content published before 2023 may reference an older
product interface. Our documentation is always the authoritative source for
current feature behavior.

Notice a few things about this example: the opening description is precise and factual, each link includes a short human-readable description (not just a URL), one blog post is marked as Optional, and the Notes section gives temporal context that helps an LLM calibrate how to use older content.

Step-by-Step Implementation Guide

Step 1: Audit Your Content Hierarchy

Before writing a single line of llms.txt, map your site's most important content. Think about it from an AI's perspective: if an LLM could only read 10 pages of your site to answer questions about you, which 10 would you choose? Those are your priority links. Group them into logical categories (Docs, Blog, Products, About, Case Studies, etc.).

Step 2: Write the Summary

Craft a 2-4 sentence summary that answers: what is this company/site, who is it for, and what is the core value proposition? Avoid marketing speak. Write it as if you're briefing a new employee on day one. The LLM will use this to frame everything else on your site.

Step 3: Build the Link Sections

For each category, add Markdown links with descriptions. Use the format:

- [Page Title](https://yourdomain.com/page): One sentence description of what
  this page contains and why it matters.

Descriptions are not optional cosmetics — they are how an LLM decides whether to retrieve and use that content. Vague descriptions lead to missed citations.

Step 4: Add the Optional Markers

Use the word "Optional:" at the start of a description to signal to the AI that this content is lower priority. This is useful for older blog posts, archived content, or supplementary material that is valid but not your primary message.

Step 5: Deploy and Verify

Place your llms.txt file at the root of your domain. Test that it's publicly accessible and returns a 200 status with a plain text content type. You can also create an llms-full.txt variant that includes the full content of your key pages (not just links) for AI systems that want to ingest everything in one request.

# Quick verification commands
curl -I https://yourdomain.com/llms.txt
# Should return: HTTP/2 200, Content-Type: text/plain

curl https://yourdomain.com/llms.txt
# Should return your file contents

The Full AI Optimization Stack: Where llms.txt Fits

llms.txt is one layer in a broader AI visibility stack. Understanding how each piece interacts helps you prioritize implementation.

robots.txt — Access Control Layer

Decides which crawlers can access which URLs. Use it to allow reputable AI crawlers (like GPTBot, ClaudeBot, PerplexityBot) access to your most important content. Blocking AI crawlers here is the single fastest way to disappear from AI-generated answers. Check your robots.txt first.

sitemap.xml — Discovery Layer

Tells crawlers what pages exist and how recently they were updated. A well-maintained sitemap ensures AI crawlers find your newest, most authoritative content instead of stumbling onto outdated pages. Keep your sitemap current and include lastmod dates.

Schema Markup — Meaning Layer

JSON-LD structured data (Article, Product, FAQ, HowTo, Organization) gives AI systems machine-readable signals about the type, author, date, and purpose of your content. Schema markup bridges the gap between raw HTML and semantic understanding. It's especially powerful for FAQs, products, and how-to content.

llms.txt — Context Layer

Where robots.txt controls access, sitemap.xml enables discovery, and schema markup provides semantic meaning, llms.txt provides curatorial context. It answers questions no other standard addresses: "Of all your content, what should I prioritize?" and "How should I understand who you are?"

For a deeper dive into how structured data fits into AI visibility, see our guide on structured data for AI visibility.

Which AI Crawlers Support llms.txt?

The llms.txt standard is still emerging, and formal adoption by major AI platforms is evolving. Here's the current landscape as of early 2026:

Perplexity (PerplexityBot)

Among the earliest real-time AI search engines to show alignment with llms.txt principles. Perplexity crawls actively and surfaces cited sources, making llms.txt-guided content prioritization directly impactful on citation rates.

OpenAI (GPTBot)

GPTBot crawls for training data and ChatGPT search. While OpenAI hasn't formally endorsed llms.txt, the structured context it provides improves how training-time and retrieval-augmented content is interpreted.

Anthropic (ClaudeBot)

Anthropic's community was central to early llms.txt discussions. ClaudeBot crawls for training data. As Claude's tool-use and web retrieval capabilities expand, llms.txt support is expected to formalize.

Google (Gemini / AI Overviews)

Google has its own rich structured data ecosystem and has not formally adopted llms.txt. However, sites with strong llms.txt implementation also tend to have the clarity and structure that benefits AI Overviews sourcing.

You.com and other AI search engines

Emerging AI-native search engines are more likely to implement llms.txt support natively given their architecture is built around LLM context windows from the start.

Even where explicit llms.txt parsing is not confirmed, the file serves a secondary purpose: any LLM that retrieves your llms.txt URL during a query will have a clear, structured summary of your site. That's a direct improvement over hoping the model pieced together an accurate picture from scattered HTML pages.

Common Mistakes and Best Practices

Mistakes to Avoid

  • Listing every URL on your site

    llms.txt is a curated guide, not a mirror of your sitemap. Overwhelming it with hundreds of URLs dilutes the signal. Focus on the 20-30 most important pages. If you need comprehensive coverage, use llms-full.txt as a separate extended variant.

  • Missing or vague link descriptions

    A link without a description is barely better than no link. The description is how the AI decides whether to retrieve and weight that content. Write descriptions that accurately reflect what the page contains, not just its title.

  • Writing marketing copy instead of factual context

    Phrases like "industry-leading" or "best-in-class" waste tokens and reduce trust signals for AI parsers. Use precise, factual language. Describe what you do, not how you feel about what you do.

  • Setting it and forgetting it

    llms.txt needs to stay current. Stale links, outdated descriptions, or missing new flagship content will result in AI systems building an inaccurate picture of your site. Treat it as a living document and review it quarterly.

Best Practices

  • 1.Prioritize pages that you most want AI systems to cite when answering questions in your category.
  • 2.Keep the total file under 10,000 tokens (roughly 8,000 words) to fit within common context windows.
  • 3.Include your organization schema (name, founding year, industry) in the Notes section.
  • 4.Use canonical URLs (with HTTPS and no trailing query strings) for all links.
  • 5.Cross-reference your llms.txt with your sitemap.xml to ensure there are no conflicting signals.
  • 6.Consider creating a llms-full.txt that inlines the full text of your most important pages for models that support long-context retrieval.

How to Know If Your llms.txt Is Actually Working

Here's the challenge: you can write a perfect llms.txt, deploy it correctly, and still have no idea whether AI models are actually using it. The file sits at your root URL. You have no native mechanism to know if GPTBot fetched it, whether Perplexity used your curated links when generating an answer, or whether Claude cited your docs page instead of a competitor's.

This is the visibility gap that Surfaced was built to close. Surfaced monitors your brand, product, and content across AI-generated outputs from ChatGPT, Perplexity, Claude, Gemini, and other LLM-powered systems. It tracks:

  • Whether your content is being cited in AI answers for your target queries
  • Which competitors are being cited instead of you
  • How your AI citation share changes over time as you optimize
  • Whether new content you publish (guided by llms.txt) gets picked up by AI systems
  • Sentiment and accuracy of AI-generated statements about your brand

llms.txt is an input. Surfaced shows you the output. Together, they give you a complete loop: you optimize your AI presence, and you measure whether it's working. Without measurement, you're optimizing blind.

Related Reading

Frequently Asked Questions

Is llms.txt an official standard?

Not yet. As of 2026, llms.txt is a community-proposed specification introduced by Jeremy Howard. It does not have formal ratification from W3C, IETF, or any major AI vendor. However, the spec has been widely adopted by forward-thinking websites and is gaining traction as AI crawlers evolve. Implementing it now is low-cost and positions you ahead of when formal adoption happens.

Will llms.txt hurt my SEO?

No. llms.txt is separate from traditional SEO signals. Search engines like Google do not crawl or index it as a ranking factor. It is purely an advisory file for LLM systems. There is no downside to adding it, and significant potential upside in AI-generated answer visibility.

Do I need llms.txt if I already have schema markup?

Both serve different purposes and are complementary. Schema markup provides machine-readable metadata within individual pages. llms.txt provides a site-level context briefing and curatorial guide. You should implement both. Schema markup helps AI understand what each page means; llms.txt helps AI understand which pages matter most and what your site is about as a whole.

How often should I update my llms.txt?

Review and update it whenever you launch significant new content, change your product positioning, add a new product line, or deprecate major pages. At minimum, audit it quarterly. Stale llms.txt files can actively mislead AI systems by pointing to outdated or irrelevant content.

Can I use llms.txt to prevent AI from using my content?

No — that's what robots.txt is for. If you want to block AI crawlers from training on your content, use the Disallow directive in robots.txt for specific bots like GPTBot or ClaudeBot. llms.txt is purely about improving AI comprehension of content you're already happy to share.

See How AI Systems Are Referencing Your Site

Surfaced monitors your AI citation share across ChatGPT, Perplexity, Claude, and more. Know whether your llms.txt and content optimizations are actually moving the needle.

Get Started →

© 2026 Surfaced. All rights reserved.