Robots.txt for AI Bots: Should You Block GPTBot, PerplexityBot, and Google-Extended?

A practical guide to handling AI crawlers in robots.txt. Covers GPTBot, PerplexityBot, Google-Extended, and OAI-SearchBot with clear recommendations on what to block and what to allow.

April 27, 2026

The New AI Crawlers

AI companies are crawling the web to train their models and power their search features. Each uses a different user-agent that you can control through robots.txt. Here are the ones that matter:

  • GPTBot: OpenAI's crawler for training data collection. Blocking this prevents your content from training future GPT models.
  • OAI-SearchBot: OpenAI's crawler for ChatGPT Search. Blocking this prevents your pages from appearing in ChatGPT's real-time search results.
  • PerplexityBot: Perplexity AI's crawler. Used for both indexing and real-time search answers.
  • Google-Extended: Google's crawler for Gemini training data. Blocking this prevents your content from training Gemini while keeping regular Google Search indexing intact.
  • ClaudeBot / anthropic-ai: Anthropic's crawlers for training data.

The Decision Framework

There are three valid approaches:

Option 1: Allow everything

Let all AI crawlers access your content. Your content trains AI models and appears in AI search results. This maximizes AI visibility but means your content is used as free training data.

Best for: Brands that prioritize maximum visibility and do not mind contributing to AI training data.

Option 2: Allow search bots, block training bots

Allow OAI-SearchBot and PerplexityBot (so your content appears in AI search results). Block GPTBot, Google-Extended, and ClaudeBot (so your content does not train future models).

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Best for: Most businesses. You get AI search visibility without giving away training data for free.

Option 3: Block everything

Block all AI crawlers. Your content does not train models and does not appear in AI search results.

Best for: Publishers who monetize content through subscriptions or ads and view AI search as a traffic competitor, not a traffic source.

What Most Businesses Should Do

Option 2 is the right choice for most businesses. You maintain visibility in AI search results (which drives brand awareness and some referral traffic) while preventing your content from being used as free training data.

Important Caveats

  • Not all AI companies respect robots.txt. Some have been caught ignoring it. Robots.txt is a voluntary protocol, not a legal enforcement mechanism.
  • If you previously allowed crawling, your existing content may already be in training datasets. Blocking now only affects future crawling.
  • Perplexity has faced criticism for not consistently honoring robots.txt directives. Monitor your logs to verify compliance.

Frequently Asked Questions

Does blocking GPTBot hurt my Google rankings?

No. GPTBot is a separate crawler from Googlebot. Blocking GPTBot has zero effect on Google Search rankings.

Can I block AI training but still appear in AI Overviews?

Yes, partially. Google AI Overviews use Googlebot (which you should not block). Blocking Google-Extended prevents Gemini training but does not prevent AI Overviews from citing your content.

Found this helpful?

Share this page with others