Web crawlers determine what gets indexed and how often. Controlling them affects your crawl budget—the number of pages Google will crawl on your site in a given timeframe.
Most sites don't need to worry about crawl budget. But if you have 10,000+ pages, frequently updated content, or want to block AI training bots, crawler control matters.
Search engines — Index your pages for search results
Social platforms — Generate link previews when shared
AI training — Scrape content for model training
Malicious — Ignore robots.txt, spoof user agents, scan for vulnerabilities. Block these at the firewall level, not with robots.txt.
| Mechanism | Use When |
|---|---|
| robots.txt | Block site sections, manage crawl budget, block AI bots |
| Sitemaps | Help crawlers discover pages, especially on large sites |
| Meta robots | Control indexing per page (noindex, nofollow) |
| Canonical URLs | Consolidate duplicate content, handle URL parameters |
| Redirects | Preserve SEO when moving/deleting pages |
| llms.txt | Guide AI tools to your documentation (via nuxt-llms) |
| X-Robots-Tag | Control non-HTML files (PDFs, images) |
| Firewall | Block malicious bots at network level |
Block page from indexing — Full guide
<script setup>
useSeoMeta({ robots: 'noindex, follow' })
</script>
Block AI training bots — Full guide
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: CCBot
Disallow: /
Fix duplicate content — Full guide
<script setup>
const route = useRoute()
useHead({
link: [{ rel: 'canonical', href: `https://mysite.com/products/${route.params.id}` }]
})
</script>
Redirect moved page — Full guide
export default defineNuxtConfig({
routeRules: {
'/old-url': { redirect: { to: '/new-url', statusCode: 301 } }
}
})
Most small sites don't need to optimize crawler behavior. But it matters when:
Crawl budget concerns — Sites with 10,000+ pages need Google to prioritize important content. Block low-value pages (search results, filtered products, admin areas) so crawlers focus on what matters.
Duplicate content — URLs like /about and /about/ compete against each other. Same with ?sort=price variations. Canonical tags consolidate these.
Staging environments — Search engines index any public site they find. Block staging/dev environments in robots.txt to avoid duplicate content issues.
AI training opt-out — GPTBot was the most-blocked crawler in 2024. Block AI training bots without affecting search rankings.
Server costs — Bots consume CPU. Heavy pages (maps, infinite scroll, SSR) cost money per request. Blocking unnecessary crawlers reduces load.
Nuxt handles crawler control through dedicated modules. Install once, configure in nuxt.config.ts, and forget about it.