Published 2026-05-16 · 12 min read · Updated May 2026

robots.txt for AI Crawlers in 2026: The Complete Guide

Last year, robots.txt was a quiet file that mostly nobody thought about. This year, it is one of the most strategically important files on your site. The reason is simple: a growing share of your highest-intent traffic now arrives via AI assistants that respect what robots.txt tells them, and the difference between "we citation-grade rank inside ChatGPT" and "we are invisible to ChatGPT" often comes down to four lines of text.

This guide walks through every AI crawler worth knowing about in 2026, what each one does, and how to configure robots.txt for the outcome you actually want. The examples are copy-paste ready. The recommendations reflect what works for most commercial sites; you should adapt them for your specific situation.

The short answer: should you allow AI crawlers?

For most public-facing commercial sites in 2026, the practical default is to allow major AI crawlers. The reason is that AI assistants (ChatGPT, Claude, Perplexity, Gemini, Copilot) are now significant sources of high-intent referral traffic, and blocking their crawlers means your content cannot be cited in their answers. For a site whose business depends on being found, that is a meaningful cost.

There are legitimate reasons to block. Content businesses (publishers, paywalled archives, original journalism) often have a real conflict between training-data extraction and their economic model. Sites with sensitive information may not want it surfaced through AI summaries. Sites whose terms of service explicitly disallow scraping have a coherent reason to enforce that policy in robots.txt.

The decision is yours. What follows is the technical guide for executing whichever choice you make.

The AI crawlers worth knowing about

The list of AI-related crawlers has grown substantially over the last two years. Below is the complete set worth addressing explicitly in 2026.

User-agentOperatorWhat it does
GPTBotOpenAICrawls public web for training data
ChatGPT-UserOpenAIFetches pages on behalf of ChatGPT users in real time
OAI-SearchBotOpenAIPowers ChatGPT search results
ClaudeBotAnthropicCrawls public web for Claude training data
Claude-WebAnthropicLegacy Anthropic crawler identifier
anthropic-aiAnthropicLive Claude content retrieval
PerplexityBotPerplexityIndexes content for Perplexity answers
Perplexity-UserPerplexityFetches pages on behalf of Perplexity users
Google-ExtendedGoogleControls whether content trains Gemini and Vertex AI
Applebot-ExtendedAppleControls whether content trains Apple Intelligence
BytespiderByteDanceCrawls for TikTok and Doubao AI
CCBotCommon CrawlPowers many AI training datasets indirectly
Cohere-AI / cohere-aiCohereCohere model training
MistralAI-UserMistralMistral model retrieval and training
Meta-ExternalAgentMetaMeta AI products and Llama training data
Meta-ExternalFetcherMetaLive retrieval for Meta AI assistants
DiffbotBotDiffbotStructured-data extraction for AI knowledge graphs
YouBotYou.comYou.com search and AI
ImagesiftBotImagesiftAI image search
TimpibotTimpiSearch index for AI applications

Note that the line between training crawlers, retrieval crawlers, and indexing crawlers has blurred. The same vendor often operates multiple user-agents for different purposes; the operator's documentation is the authoritative guide for what each one does.

The recommended robots.txt template for most commercial sites

This is the template we use across crawlcrawl's own properties and recommend to most commercial sites. It explicitly allows the major search engines and AI assistants while preserving the ability to block specific user-agents later if your strategy changes.

# Default: allow all
User-agent: *
Allow: /

# Search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: DuckDuckBot
Allow: /

# AI assistants
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Bytespider
Allow: /

User-agent: CCBot
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: MistralAI-User
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Meta-ExternalFetcher
Allow: /

# Social previewers
User-agent: facebookexternalhit
Allow: /

User-agent: Twitterbot
Allow: /

User-agent: LinkedInBot
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

This template is intentionally explicit. The default User-agent: * with Allow: / covers most well-behaved bots, but listing the AI crawlers separately makes your intent clear and gives you a single place to flip a single user-agent without touching the rest of the file.

The recommended template for content businesses

If your business model depends on people visiting your site to read content (publishing, paywalled archives, journalism, premium how-to libraries), you may want to allow AI assistants to cite you (retrieval) while blocking them from training on you (extraction). The distinction is imperfect but useful.

# Default: allow
User-agent: *
Allow: /

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Allow live retrieval crawlers (cite us, do not train on us)
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: MistralAI-User
Allow: /

User-agent: Meta-ExternalFetcher
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

This pattern is becoming common among newspapers, magazines, and publishers who want to remain citable inside AI answers (because citations send referral traffic) while opting out of training-data extraction.

What changes when you block AI crawlers

Three things happen when you block an AI crawler.

  1. Training-data flow stops. Future model versions trained after your block takes effect will not learn from your content. Existing model versions may already contain content crawled before the block; that does not disappear retroactively.
  2. Citation flow stops for live-retrieval AI assistants. ChatGPT, Claude, Perplexity, and Gemini fetch pages in real time when answering certain questions. Blocking their retrieval user-agents means they will not cite your live pages, only what is already cached or in training data.
  3. Referral traffic drops. AI-driven referral traffic typically arrives via clicks on citations inside answers. No citations means no clicks.

The size of these effects depends on your category. For technical documentation, developer tools, and informational content, AI-referral traffic in 2026 can be 5-20% of total organic traffic and growing. For e-commerce and transactional sites, the share is typically smaller. For brand-name search and high-intent commercial queries, AI assistants often show citations directly in the answer view, and being one of those citations is meaningful.

Common mistakes in AI-crawler robots.txt

A few patterns we see often that create unintended outcomes.

Inheriting a default block from a hosting provider

Some hosting providers and security products add aggressive defaults to robots.txt that block all bots, including AI crawlers, without making it obvious. The result is that a site is invisible to ChatGPT for months before anyone notices. Audit your live robots.txt at least quarterly to catch this.

Blocking a user-agent that does not exist

"AI" and "LLM" are not valid user-agent values. Blocking them does nothing. Every AI crawler has a specific user-agent string; only those specific strings are honored.

Using robots.txt for sensitive content

robots.txt is a polite request, not an enforcement mechanism. Well-behaved crawlers respect it; less-well-behaved actors ignore it. For genuinely sensitive content, use authentication, IP allowlisting, or other enforcement, not robots.txt.

Forgetting that crawlers can be blocked at other layers

Your CDN, WAF, and bot-protection vendor may be blocking AI crawlers at the network layer regardless of what robots.txt says. If you have allowed an AI crawler in robots.txt but your CDN is still returning 403 to that user-agent, the crawler will not reach your content. Audit both layers.

Auditing whether AI crawlers can reach your site

Three ways to test, in increasing depth.

1. Fetch your robots.txt and read it

Open https://yoursite.com/robots.txt in a browser. Read it. Confirm the rules are what you intend. If there is no robots.txt file at all, the default behavior is to allow all bots, which is usually the right starting point.

2. Send a test request with each user-agent

Use curl with a specific user-agent to verify your site responds correctly:

curl -A "GPTBot/1.0" https://yoursite.com/somepage
curl -A "ClaudeBot" https://yoursite.com/somepage
curl -A "PerplexityBot" https://yoursite.com/somepage

If the responses are HTTP 200 with content, the crawler can reach you. If they are 403, 404, or empty, something between the crawler and your origin is blocking the request.

3. Use a dedicated AI-bot audit

For continuous monitoring, an audit tool that checks AI-crawler accessibility across your whole site is useful. crawlcrawl's AI-bot audit endpoint returns the resolved policy for every major AI user-agent against any URL you give it, which makes ongoing monitoring straightforward.

"LLMs.txt generation lets us hand a clean training surface to our AI tutor without a separate ingestion pipeline." — Amit Tanwar, Founder, Networkers Home

Configuration patterns by platform

Where you actually edit robots.txt depends on how your site is built. A few common patterns.

Static sites and traditional servers

Put the file at the document root and let your web server serve it as plain text. This is the cleanest setup; the file is exactly what visitors see when they hit /robots.txt. Most static site generators (Next.js, Astro, Hugo, Jekyll) place a public/robots.txt file alongside their content; that file ships unchanged at build time.

Cloudflare Pages, Vercel, and Netlify

Put robots.txt in your public/ or static/ directory depending on the framework. The deployment pipeline serves it as a static asset. Verify after deploy by fetching the live URL; some frameworks rewrite paths in unexpected ways.

CDN edge rules and Workers

Some teams synthesize robots.txt at the edge with a Cloudflare Worker or equivalent. This is useful when robots.txt rules should vary by environment (staging blocks all bots; production allows them). Verify the worker output matches your intent in every environment.

WordPress and other CMSes

Most modern WordPress SEO plugins generate robots.txt dynamically. If your plugin's default conflicts with your AI-crawler intent, you usually need to edit it through the plugin's settings rather than uploading a raw robots.txt file. The plugin will respect or override your file depending on configuration.

The complementary file: llms.txt

robots.txt tells crawlers where they can go. llms.txt is an emerging-standard file that tells AI assistants what your site is and how to summarize it. The two complement each other: robots.txt controls access; llms.txt controls representation.

A well-crafted llms.txt is a structured, plain-text summary of your site (product list, pricing, key links, common questions) that AI assistants can ingest cheaply and cite confidently. Most teams who configure robots.txt for AI crawlers also publish llms.txt for the same reason: making it easy for AI assistants to understand what you do increases the quality of citations you receive.

crawlcrawl's llms.txt builder generates the file for any site you point it at, which is the fastest way to ship a llms.txt without writing it by hand.

The robots.txt decision tree for 2026

A short decision tree that covers most situations.

  1. Is your business commercial and depends on being found? Allow major AI crawlers. The referral traffic alone usually justifies the access.
  2. Are you a content business with original journalism or paywalled material? Consider the training-block-but-allow-retrieval pattern above. You stay citable while opting out of training-data extraction.
  3. Is your site internal-only, behind authentication, or genuinely private? robots.txt is the wrong tool. Use authentication and network controls. AI crawlers will not reach authenticated content regardless.
  4. Are you experimenting? Default to allow, monitor your AI-referral traffic for a quarter, then revise based on actual data.

Frequently asked questions

Should I allow AI crawlers in robots.txt?

For most public-facing commercial sites in 2026, allowing AI crawlers is the practical default. AI assistants are now significant traffic sources, and blocking their crawlers means your content cannot be cited in their answers.

How do I block GPTBot in robots.txt?

Add User-agent: GPTBot followed by Disallow: / on the next line. The same pattern applies to ChatGPT-User, ClaudeBot, anthropic-ai, PerplexityBot, and other AI crawlers.

What is the difference between GPTBot and ChatGPT-User?

GPTBot is OpenAI's training-data crawler. ChatGPT-User fetches pages on behalf of ChatGPT users in real time when they ask questions. Blocking GPTBot stops training-data collection; blocking ChatGPT-User prevents your content from being cited in live ChatGPT answers. You can choose either, both, or neither.

Will blocking AI crawlers hurt my Google SEO?

Blocking AI crawlers does not directly affect Google's ranking algorithm. It does affect whether your content appears in AI Overview, ChatGPT, Claude, Perplexity, and similar surfaces. Treat the two as separate strategic decisions.

How do I block all AI crawlers at once?

You must list each AI crawler's user-agent separately. There is no single wildcard. The major ones to address are GPTBot, ChatGPT-User, ClaudeBot, anthropic-ai, PerplexityBot, Google-Extended, Applebot-Extended, Bytespider, CCBot, Cohere-AI, and Meta-ExternalAgent.

Does robots.txt apply to AI agents like ChatGPT plugins?

Generally yes, but the picture is evolving. Most AI assistants respect robots.txt for their retrieval crawlers; some plugins or third-party integrations may have their own user-agents. Audit the user-agents that actually hit your site over a quarter to see what is reaching you.

How often should I update my robots.txt for AI crawlers?

Quarterly is a reasonable cadence. New AI assistants and crawlers launch frequently, and existing ones occasionally change their user-agent strings. A quarterly review keeps your policy current without becoming a maintenance burden.

How do I validate that my robots.txt policy actually works?

Three checks cover most of the validation work. First, fetch the file directly from a non-cached URL and confirm the syntax parses with a robots.txt validator. Second, search your server logs for the user-agent strings of crawlers you allowed and confirm they show up after the change took effect. Third, query an AI assistant like ChatGPT, Claude, or Perplexity with a question that should be answerable from your site and see whether your URL is cited. Citation flow is the ground-truth signal that the policy is having the intended effect. The crawlcrawl AI-bot audit endpoint at GET /v1/robots-policy resolves the parsed policy for every major AI crawler against any URL you give it, which makes the first check a one-line API call instead of a manual file inspection. The endpoint is free at every tier including the free tier.

What happens if I change my mind after blocking an AI crawler?

Reversing a block is straightforward at the file level, but the practical recovery time depends on the crawler's cache cycle. Most major AI crawlers refresh their robots.txt cache every 24 to 48 hours, after which they re-evaluate the policy and resume crawling. Some assistants additionally cache the inference that "this site is blocked" inside their training pipeline, which means the citation flow can lag the policy change by weeks. The pragmatic implication is to make blocking decisions deliberately, knowing the unwinding is slow even when the file change is instant.

The takeaway

robots.txt for AI crawlers in 2026 is a strategic file, not just a technical one. The decision about which AI assistants can read your content shapes how visible your brand is inside the answer surfaces where buyers increasingly find products. For most commercial sites, allowing the major AI crawlers is the default that maximizes discoverability and citation flow.

Start by reviewing your current robots.txt against the templates above. Choose the posture that fits your business. Audit at the CDN and WAF layers to confirm the policy actually reaches the wire. Then publish a complementary llms.txt to give AI assistants a clean, structured representation of your site to cite from. The combination of the two files is the practical AEO infrastructure layer for the next several years.

Start free at crawlcrawl.com/signup →