The New AI Crawlers
AI companies are crawling the web to train their models and power their search features. Each uses a different user-agent that you can control through robots.txt. Here are the ones that matter:
- GPTBot: OpenAI's crawler for training data collection. Blocking this prevents your content from training future GPT models.
- OAI-SearchBot: OpenAI's crawler for ChatGPT Search. Blocking this prevents your pages from appearing in ChatGPT's real-time search results.
- PerplexityBot: Perplexity AI's crawler. Used for both indexing and real-time search answers.
- Google-Extended: Google's crawler for Gemini training data. Blocking this prevents your content from training Gemini while keeping regular Google Search indexing intact.
- ClaudeBot / anthropic-ai: Anthropic's crawlers for training data.
The Decision Framework
There are three valid approaches:
Option 1: Allow everything
Let all AI crawlers access your content. Your content trains AI models and appears in AI search results. This maximizes AI visibility but means your content is used as free training data.
Best for: Brands that prioritize maximum visibility and do not mind contributing to AI training data.
Option 2: Allow search bots, block training bots
Allow OAI-SearchBot and PerplexityBot (so your content appears in AI search results). Block GPTBot, Google-Extended, and ClaudeBot (so your content does not train future models).
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
Best for: Most businesses. You get AI search visibility without giving away training data for free.
Option 3: Block everything
Block all AI crawlers. Your content does not train models and does not appear in AI search results.
Best for: Publishers who monetize content through subscriptions or ads and view AI search as a traffic competitor, not a traffic source.
What Most Businesses Should Do
Option 2 is the right choice for most businesses. You maintain visibility in AI search results (which drives brand awareness and some referral traffic) while preventing your content from being used as free training data.
Important Caveats
- Not all AI companies respect robots.txt. Some have been caught ignoring it. Robots.txt is a voluntary protocol, not a legal enforcement mechanism.
- If you previously allowed crawling, your existing content may already be in training datasets. Blocking now only affects future crawling.
- Perplexity has faced criticism for not consistently honoring robots.txt directives. Monitor your logs to verify compliance.
Frequently Asked Questions
Does blocking GPTBot hurt my Google rankings?
No. GPTBot is a separate crawler from Googlebot. Blocking GPTBot has zero effect on Google Search rankings.
Can I block AI training but still appear in AI Overviews?
Yes, partially. Google AI Overviews use Googlebot (which you should not block). Blocking Google-Extended prevents Gemini training but does not prevent AI Overviews from citing your content.