Robots Txt
- robots.txt is advisory. crawlers can ignore it, never use for security
- Use Nuxt Robots module for environment-aware generation (auto-block staging)
- In 2026, use
Content-SignalandContent-Usagefor granular AI governance - Include sitemap reference and distinguish between Search and AI Training bots
The robots.txt file controls which parts of your site crawlers can access. Officially adopted as RFC 9309 in September 2022, it's primarily used to manage crawl budget on large sites and manage AI Bot Governance.
In 2026, robots.txt has evolved from a simple "block/allow" file into a sophisticated policy document for the Agentic Web. It's how you tell AI models whether they can use your data for training, real-time answering, or agentic actions.
Quick Setup
Static robots.txt
For simple static rules, add the file in your public directory:
public/
robots.txt
Add your rules:
# Allow all crawlers
User-agent: *
Disallow:
# Optionally point to your sitemap
Sitemap: https://mysite.com/sitemap.xml
Server Route
For custom dynamic generation, create a server route:
export default defineEventHandler((event) => {
const isDev = process.env.NODE_ENV !== 'production'
const robots = isDev
? 'User-agent: *\nDisallow: /'
: 'User-agent: *\nDisallow:\nSitemap: https://mysite.com/sitemap.xml'
setHeader(event, 'Content-Type', 'text/plain')
return robots
})
Automatic Generation with Module
For environment-aware generation (auto-block staging), use the Nuxt Robots module:
Install the module:
npx nuxi@latest module add robots
The module automatically generates robots.txt with zero config. For environment-specific rules:
export default defineNuxtConfig({
modules: ['@nuxtjs/robots'],
robots: {
disallow: process.env.NODE_ENV !== 'production' ? '/' : undefined
}
})
Robots.txt Syntax
The robots.txt file consists of directives grouped by user agent. Google uses the most specific matching rule based on path length:
# Define which crawler these rules apply to
User-agent: *
# Block access to specific paths
Disallow: /admin
# Allow access to specific paths (optional, more specific than Disallow)
Allow: /admin/public
# Point to your sitemap
Sitemap: https://mysite.com/sitemap.xml
User-agent
The User-agent directive specifies which crawler the rules apply to:
# All crawlers
User-agent: *
# Just Googlebot
User-agent: Googlebot
# Multiple specific crawlers
User-agent: Googlebot
User-agent: Bingbot
Disallow: /private
Common crawler user agents:
- Googlebot: Google's search crawler (28% of all bot traffic in 2025)
- Google-Extended: Google's AI training crawler (separate from search)
- GPTBot: OpenAI's AI training crawler (7.5% of bot traffic)
- ClaudeBot: Anthropic's AI training crawler
- CCBot: Common Crawl's dataset builder (frequently blocked)
- Bingbot: Microsoft's search crawler
- FacebookExternalHit: Facebook's link preview crawler
Allow / Disallow
The Allow and Disallow directives control path access:
User-agent: *
# Block all paths starting with /admin
Disallow: /admin
# Block a specific file
Disallow: /private.html
# Block files with specific extensions
Disallow: /*.pdf$
# Block URL parameters
Disallow: /*?*
Wildcards supported (RFC 9309):
*. matches zero or more characters$. matches the end of the URL- Paths are case-sensitive and relative to domain root
Sitemap
The Sitemap directive tells crawlers where to find your sitemap.xml:
Sitemap: https://mysite.com/sitemap.xml
# Multiple sitemaps
Sitemap: https://mysite.com/products-sitemap.xml
Sitemap: https://mysite.com/blog-sitemap.xml
With the Nuxt Sitemap module, the sitemap URL is automatically added to your robots.txt.
Crawl-Delay (Non-Standard)
Crawl-Delay is not part of RFC 9309. Google ignores it. Bing and Yandex support it:
User-agent: Bingbot
Crawl-delay: 10 # seconds between requests
For Google, crawl rate is managed in Search Console.
Security: Why robots.txt Fails
Robots.txt is not a security mechanism. Malicious crawlers ignore it, and listing paths in Disallow reveals their location to attackers.
Common mistake:
# ❌ Advertises your admin panel location
User-agent: *
Disallow: /admin
Disallow: /wp-admin
Disallow: /api/internal
Use proper authentication instead. See our security guide for details.
Crawling vs Indexing
Blocking a URL in robots.txt prevents crawling but doesn't prevent indexing. If other sites link to the URL, Google can still index it without crawling, showing the URL with no snippet.
To prevent indexing:
- Use
noindexmeta tag (requires allowing crawl) - Use password protection or authentication
- Return 404/410 status codes
Don't block pages with noindex in robots.txt. Google can't see the tag if it can't crawl.
Common Mistakes
1. Blocking JavaScript and CSS
Google needs JavaScript and CSS to render pages. Blocking them breaks indexing:
# ❌ Prevents Google from rendering your Nuxt app
User-agent: *
Disallow: /assets/
Disallow: /*.js$
Disallow: /*.css$
Nuxt apps are JavaScript-heavy. Never block .js, .css, or /assets/ from Googlebot.
2. Blocking Dev Sites in Production
Copy-pasting a dev robots.txt to production blocks all crawlers:
# ❌ Accidentally left from staging
User-agent: *
Disallow: /
The Nuxt Robots module handles this automatically based on environment.
3. Confusing robots.txt with noindex
Blocking pages doesn't remove them from search results. Use noindex meta tags for that.
Testing Your robots.txt
- Check syntax: Visit
https://yoursite.com/robots.txtto confirm it loads - Google Search Console robots.txt tester validates syntax and tests URLs
- Verify crawlers can access: Check server logs for 200 status on
/robots.txt
Common Patterns
Allow Everything (Default)
User-agent: *
Disallow:
Block Everything
Useful for staging or development environments.
User-agent: *
Disallow: /
See our security guide for more on environment protection.
Block AI Training Crawlers
GPTBot was the most blocked bot in 2024, fully disallowed by 250 domains. Blocking AI training bots doesn't affect search rankings:
# Block AI model training (doesn't affect Google search)
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
User-agent: Google-Extended
Disallow: /
Google-Extended is separate from Googlebot. blocking it won't hurt search visibility.
AI Directives: Content-Usage & Content-Signal
In 2026, blocking user agents isn't always enough. Two emerging standards let you express granular preferences about how AI systems use your content without blocking crawlers entirely. This is crucial for AI Search Optimization (ASO): you want to be indexed for search but may want to opt-out of model training.
- Content-Usage (IETF aipref-vocab) . Uses
y/nvalues fortrain-aiandserve-ai. - Content-Signal (Cloudflare) . Uses
yes/novalues forsearch,ai-input,ai-train.
User-agent: *
Allow: /
# 2026 AI Governance Strategy
# Allow AI for real-time answers (ASO), but block training
Content-Usage: train-ai=n, serve-ai=y
Content-Signal: search=yes, ai-input=yes, ai-train=no
Nuxt Implementation
With the Nuxt Robots module, configure these signals programmatically in nuxt.config.ts:
export default defineNuxtConfig({
robots: {
groups: [{
userAgent: '*',
allow: '/',
contentUsage: {
'train-ai': 'n',
'serve-ai': 'y'
},
contentSignal: {
'ai-train': 'no',
'ai-input': 'yes',
'search': 'yes'
}
}]
}
})
ai-input? Real-time AI tools like Perplexity or ChatGPT Search use this to provide citations. If you block this, you won't appear as a source in AI-generated answers.Block Search, Allow Social Sharing
For private sites where you still want link previews:
# Block search engines
User-agent: Googlebot
User-agent: Bingbot
Disallow: /
# Allow social link preview crawlers
User-agent: facebookexternalhit
User-agent: Twitterbot
User-agent: Slackbot
Allow: /
Optimize Crawl Budget for Large Sites
If you have 10,000+ pages, block low-value URLs to focus crawl budget on important content:
User-agent: *
# Block internal search results
Disallow: /search?
# Block infinite scroll pagination
Disallow: /*?page=
# Block filtered/sorted product pages
Disallow: /products?*sort=
Disallow: /products?*filter=
# Block print versions
Disallow: /*/print
Sites under 1,000 pages don't need crawl budget optimization.