robots.txt for AI Crawlers in 2026: The Complete Guide
Last year, robots.txt was a quiet file that mostly nobody thought about. This year, it is one of the most strategically important files on your site. The reason is simple: a growing share of your highest-intent traffic now arrives via AI assistants that respect what robots.txt tells them, and the difference between "we citation-grade rank inside ChatGPT" and "we are invisible to ChatGPT" often comes down to four lines of text.
This guide walks through every AI crawler worth knowing about in 2026, what each one does, and how to configure robots.txt for the outcome you actually want. The examples are copy-paste ready. The recommendations reflect what works for most commercial sites; you should adapt them for your specific situation.
The short answer: should you allow AI crawlers?
For most public-facing commercial sites in 2026, the practical default is to allow major AI crawlers. The reason is that AI assistants (ChatGPT, Claude, Perplexity, Gemini, Copilot) are now significant sources of high-intent referral traffic, and blocking their crawlers means your content cannot be cited in their answers. For a site whose business depends on being found, that is a meaningful cost.
There are legitimate reasons to block. Content businesses (publishers, paywalled archives, original journalism) often have a real conflict between training-data extraction and their economic model. Sites with sensitive information may not want it surfaced through AI summaries. Sites whose terms of service explicitly disallow scraping have a coherent reason to enforce that policy in robots.txt.
The decision is yours. What follows is the technical guide for executing whichever choice you make.
The AI crawlers worth knowing about
The list of AI-related crawlers has grown substantially over the last two years. Below is the complete set worth addressing explicitly in 2026.
| User-agent | Operator | What it does |
|---|---|---|
GPTBot | OpenAI | Crawls public web for training data |
ChatGPT-User | OpenAI | Fetches pages on behalf of ChatGPT users in real time |
OAI-SearchBot | OpenAI | Powers ChatGPT search results |
ClaudeBot | Anthropic | Crawls public web for Claude training data |
Claude-Web | Anthropic | Legacy Anthropic crawler identifier |
anthropic-ai | Anthropic | Live Claude content retrieval |
PerplexityBot | Perplexity | Indexes content for Perplexity answers |
Perplexity-User | Perplexity | Fetches pages on behalf of Perplexity users |
Google-Extended | Controls whether content trains Gemini and Vertex AI | |
Applebot-Extended | Apple | Controls whether content trains Apple Intelligence |
Bytespider | ByteDance | Crawls for TikTok and Doubao AI |
CCBot | Common Crawl | Powers many AI training datasets indirectly |
Cohere-AI / cohere-ai | Cohere | Cohere model training |
MistralAI-User | Mistral | Mistral model retrieval and training |
Meta-ExternalAgent | Meta | Meta AI products and Llama training data |
Meta-ExternalFetcher | Meta | Live retrieval for Meta AI assistants |
DiffbotBot | Diffbot | Structured-data extraction for AI knowledge graphs |
YouBot | You.com | You.com search and AI |
ImagesiftBot | Imagesift | AI image search |
Timpibot | Timpi | Search index for AI applications |
Note that the line between training crawlers, retrieval crawlers, and indexing crawlers has blurred. The same vendor often operates multiple user-agents for different purposes; the operator's documentation is the authoritative guide for what each one does.
The recommended robots.txt template for most commercial sites
This is the template we use across crawlcrawl's own properties and recommend to most commercial sites. It explicitly allows the major search engines and AI assistants while preserving the ability to block specific user-agents later if your strategy changes.
# Default: allow all User-agent: * Allow: / # Search engines User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / User-agent: DuckDuckBot Allow: / # AI assistants User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ClaudeBot Allow: / User-agent: anthropic-ai Allow: / User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / User-agent: Google-Extended Allow: / User-agent: Applebot-Extended Allow: / User-agent: Bytespider Allow: / User-agent: CCBot Allow: / User-agent: cohere-ai Allow: / User-agent: MistralAI-User Allow: / User-agent: Meta-ExternalAgent Allow: / User-agent: Meta-ExternalFetcher Allow: / # Social previewers User-agent: facebookexternalhit Allow: / User-agent: Twitterbot Allow: / User-agent: LinkedInBot Allow: / Sitemap: https://yoursite.com/sitemap.xml
This template is intentionally explicit. The default User-agent: * with Allow: / covers most well-behaved bots, but listing the AI crawlers separately makes your intent clear and gives you a single place to flip a single user-agent without touching the rest of the file.
The recommended template for content businesses
If your business model depends on people visiting your site to read content (publishing, paywalled archives, journalism, premium how-to libraries), you may want to allow AI assistants to cite you (retrieval) while blocking them from training on you (extraction). The distinction is imperfect but useful.
# Default: allow User-agent: * Allow: / # Block training crawlers User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: Applebot-Extended Disallow: / User-agent: Bytespider Disallow: / User-agent: CCBot Disallow: / User-agent: cohere-ai Disallow: / User-agent: Meta-ExternalAgent Disallow: / # Allow live retrieval crawlers (cite us, do not train on us) User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / User-agent: anthropic-ai Allow: / User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / User-agent: MistralAI-User Allow: / User-agent: Meta-ExternalFetcher Allow: / Sitemap: https://yoursite.com/sitemap.xml
This pattern is becoming common among newspapers, magazines, and publishers who want to remain citable inside AI answers (because citations send referral traffic) while opting out of training-data extraction.
What changes when you block AI crawlers
Three things happen when you block an AI crawler.
- Training-data flow stops. Future model versions trained after your block takes effect will not learn from your content. Existing model versions may already contain content crawled before the block; that does not disappear retroactively.
- Citation flow stops for live-retrieval AI assistants. ChatGPT, Claude, Perplexity, and Gemini fetch pages in real time when answering certain questions. Blocking their retrieval user-agents means they will not cite your live pages, only what is already cached or in training data.
- Referral traffic drops. AI-driven referral traffic typically arrives via clicks on citations inside answers. No citations means no clicks.
The size of these effects depends on your category. For technical documentation, developer tools, and informational content, AI-referral traffic in 2026 can be 5-20% of total organic traffic and growing. For e-commerce and transactional sites, the share is typically smaller. For brand-name search and high-intent commercial queries, AI assistants often show citations directly in the answer view, and being one of those citations is meaningful.
Common mistakes in AI-crawler robots.txt
A few patterns we see often that create unintended outcomes.
Inheriting a default block from a hosting provider
Some hosting providers and security products add aggressive defaults to robots.txt that block all bots, including AI crawlers, without making it obvious. The result is that a site is invisible to ChatGPT for months before anyone notices. Audit your live robots.txt at least quarterly to catch this.
Blocking a user-agent that does not exist
"AI" and "LLM" are not valid user-agent values. Blocking them does nothing. Every AI crawler has a specific user-agent string; only those specific strings are honored.
Using robots.txt for sensitive content
robots.txt is a polite request, not an enforcement mechanism. Well-behaved crawlers respect it; less-well-behaved actors ignore it. For genuinely sensitive content, use authentication, IP allowlisting, or other enforcement, not robots.txt.
Forgetting that crawlers can be blocked at other layers
Your CDN, WAF, and bot-protection vendor may be blocking AI crawlers at the network layer regardless of what robots.txt says. If you have allowed an AI crawler in robots.txt but your CDN is still returning 403 to that user-agent, the crawler will not reach your content. Audit both layers.
Auditing whether AI crawlers can reach your site
Three ways to test, in increasing depth.
1. Fetch your robots.txt and read it
Open https://yoursite.com/robots.txt in a browser. Read it. Confirm the rules are what you intend. If there is no robots.txt file at all, the default behavior is to allow all bots, which is usually the right starting point.
2. Send a test request with each user-agent
Use curl with a specific user-agent to verify your site responds correctly:
curl -A "GPTBot/1.0" https://yoursite.com/somepage curl -A "ClaudeBot" https://yoursite.com/somepage curl -A "PerplexityBot" https://yoursite.com/somepage
If the responses are HTTP 200 with content, the crawler can reach you. If they are 403, 404, or empty, something between the crawler and your origin is blocking the request.
3. Use a dedicated AI-bot audit
For continuous monitoring, an audit tool that checks AI-crawler accessibility across your whole site is useful. crawlcrawl's AI-bot audit endpoint returns the resolved policy for every major AI user-agent against any URL you give it, which makes ongoing monitoring straightforward.
"LLMs.txt generation lets us hand a clean training surface to our AI tutor without a separate ingestion pipeline." — Amit Tanwar, Founder, Networkers Home
Configuration patterns by platform
Where you actually edit robots.txt depends on how your site is built. A few common patterns.
Static sites and traditional servers
Put the file at the document root and let your web server serve it as plain text. This is the cleanest setup; the file is exactly what visitors see when they hit /robots.txt. Most static site generators (Next.js, Astro, Hugo, Jekyll) place a public/robots.txt file alongside their content; that file ships unchanged at build time.
Cloudflare Pages, Vercel, and Netlify
Put robots.txt in your public/ or static/ directory depending on the framework. The deployment pipeline serves it as a static asset. Verify after deploy by fetching the live URL; some frameworks rewrite paths in unexpected ways.
CDN edge rules and Workers
Some teams synthesize robots.txt at the edge with a Cloudflare Worker or equivalent. This is useful when robots.txt rules should vary by environment (staging blocks all bots; production allows them). Verify the worker output matches your intent in every environment.
WordPress and other CMSes
Most modern WordPress SEO plugins generate robots.txt dynamically. If your plugin's default conflicts with your AI-crawler intent, you usually need to edit it through the plugin's settings rather than uploading a raw robots.txt file. The plugin will respect or override your file depending on configuration.
The complementary file: llms.txt
robots.txt tells crawlers where they can go. llms.txt is an emerging-standard file that tells AI assistants what your site is and how to summarize it. The two complement each other: robots.txt controls access; llms.txt controls representation.
A well-crafted llms.txt is a structured, plain-text summary of your site (product list, pricing, key links, common questions) that AI assistants can ingest cheaply and cite confidently. Most teams who configure robots.txt for AI crawlers also publish llms.txt for the same reason: making it easy for AI assistants to understand what you do increases the quality of citations you receive.
crawlcrawl's llms.txt builder generates the file for any site you point it at, which is the fastest way to ship a llms.txt without writing it by hand.
The robots.txt decision tree for 2026
A short decision tree that covers most situations.
- Is your business commercial and depends on being found? Allow major AI crawlers. The referral traffic alone usually justifies the access.
- Are you a content business with original journalism or paywalled material? Consider the training-block-but-allow-retrieval pattern above. You stay citable while opting out of training-data extraction.
- Is your site internal-only, behind authentication, or genuinely private? robots.txt is the wrong tool. Use authentication and network controls. AI crawlers will not reach authenticated content regardless.
- Are you experimenting? Default to allow, monitor your AI-referral traffic for a quarter, then revise based on actual data.
Frequently asked questions
Should I allow AI crawlers in robots.txt?
For most public-facing commercial sites in 2026, allowing AI crawlers is the practical default. AI assistants are now significant traffic sources, and blocking their crawlers means your content cannot be cited in their answers.
How do I block GPTBot in robots.txt?
Add User-agent: GPTBot followed by Disallow: / on the next line. The same pattern applies to ChatGPT-User, ClaudeBot, anthropic-ai, PerplexityBot, and other AI crawlers.
What is the difference between GPTBot and ChatGPT-User?
GPTBot is OpenAI's training-data crawler. ChatGPT-User fetches pages on behalf of ChatGPT users in real time when they ask questions. Blocking GPTBot stops training-data collection; blocking ChatGPT-User prevents your content from being cited in live ChatGPT answers. You can choose either, both, or neither.
Will blocking AI crawlers hurt my Google SEO?
Blocking AI crawlers does not directly affect Google's ranking algorithm. It does affect whether your content appears in AI Overview, ChatGPT, Claude, Perplexity, and similar surfaces. Treat the two as separate strategic decisions.
How do I block all AI crawlers at once?
You must list each AI crawler's user-agent separately. There is no single wildcard. The major ones to address are GPTBot, ChatGPT-User, ClaudeBot, anthropic-ai, PerplexityBot, Google-Extended, Applebot-Extended, Bytespider, CCBot, Cohere-AI, and Meta-ExternalAgent.
Does robots.txt apply to AI agents like ChatGPT plugins?
Generally yes, but the picture is evolving. Most AI assistants respect robots.txt for their retrieval crawlers; some plugins or third-party integrations may have their own user-agents. Audit the user-agents that actually hit your site over a quarter to see what is reaching you.
How often should I update my robots.txt for AI crawlers?
Quarterly is a reasonable cadence. New AI assistants and crawlers launch frequently, and existing ones occasionally change their user-agent strings. A quarterly review keeps your policy current without becoming a maintenance burden.
How do I validate that my robots.txt policy actually works?
Three checks cover most of the validation work. First, fetch the file directly from a non-cached URL and confirm the syntax parses with a robots.txt validator. Second, search your server logs for the user-agent strings of crawlers you allowed and confirm they show up after the change took effect. Third, query an AI assistant like ChatGPT, Claude, or Perplexity with a question that should be answerable from your site and see whether your URL is cited. Citation flow is the ground-truth signal that the policy is having the intended effect. The crawlcrawl AI-bot audit endpoint at GET /v1/robots-policy resolves the parsed policy for every major AI crawler against any URL you give it, which makes the first check a one-line API call instead of a manual file inspection. The endpoint is free at every tier including the free tier.
What happens if I change my mind after blocking an AI crawler?
Reversing a block is straightforward at the file level, but the practical recovery time depends on the crawler's cache cycle. Most major AI crawlers refresh their robots.txt cache every 24 to 48 hours, after which they re-evaluate the policy and resume crawling. Some assistants additionally cache the inference that "this site is blocked" inside their training pipeline, which means the citation flow can lag the policy change by weeks. The pragmatic implication is to make blocking decisions deliberately, knowing the unwinding is slow even when the file change is instant.
The takeaway
robots.txt for AI crawlers in 2026 is a strategic file, not just a technical one. The decision about which AI assistants can read your content shapes how visible your brand is inside the answer surfaces where buyers increasingly find products. For most commercial sites, allowing the major AI crawlers is the default that maximizes discoverability and citation flow.
Start by reviewing your current robots.txt against the templates above. Choose the posture that fits your business. Audit at the CDN and WAF layers to confirm the policy actually reaches the wire. Then publish a complementary llms.txt to give AI assistants a clean, structured representation of your site to cite from. The combination of the two files is the practical AEO infrastructure layer for the next several years.