Controlling Web Crawlers in Vue & Nuxt
Introduction
A web crawler, also called a "spider" or "robot", is simply a program scanning web pages for data. While crawlers vary rapidly in complexity, we can build our own in one line of code.
curl https://nuxtseo.com -H "User-Agent: NuxtSEOBot"
fetch('https://nuxtseo.com', {
headers: { 'User-Agent': 'NuxtSEOBot' }
})
Most crawlers will extract all links from the response and crawl all pages from there.
There are many categories of web crawler, for example:
Search Engines
These crawlers are accessing your site to find pages that they can "index". An indexed page is one that will appear in the search engine results page (SERP).
Learn more in our meta tags guide.
Social
These crawlers are mostly accessing your site to generate previews when shared on their platform.
See our canonical URLs guide for ensuring consistent social previews.
AI Crawlers
These crawlers are accessing your site to generate content or data for their AI models.
Learn to manage AI crawler access in our security guide.
Malicious Crawlers
While most of these crawlers can be good in the right context, some can be malicious or just not useful.
Malicious crawlers can easily mask the user agent of a regular user and ignore your robot rules while they search for vulnerabilities or scrape your content.
# Pretend to be a Chrome browser on Linux, see if they leaked their .env
curl https://nuxtseo.com/.env -H "User-Agent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36'"
// Pretend to be a Chrome browser on Linux, see if they leaked their .env
fetch('https://nuxtseo.com/.env', {
headers: {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36'
}
})
Controlling Crawlers
Your site is most likely already getting thousands of these bot visits.
At a high level, we control these bots using:
robots.txt : Tells specific crawlers what they can and can't accesssitemap.xml : Tells crawlers what pages are on your site<meta name="robots" content="..."> : Tells search engine crawlers how to index your pageX-Robots-Tag : The same as the meta tag but sent as an HTTP header which is useful for files like PDFs<link rel="canonical" href="..."> : Tells search engine crawlers which URL is the preferred version of a page- HTTP Redirects: Redirect bots and keep your page SEO benefits
- Web Application Firewall: At a network level, identify and block malicious bots from accessing your site
Control Mechanism | Best Used When |
---|---|
Nuxt Robots | - Blocking large sections of your site (e.g., - Managing crawler bandwidth on heavy pages - Preventing crawling of development assets |
Nuxt Robots | - Controlling indexing for individual pages - Managing dynamic content (search results, filtered pages) - Setting page-specific crawler directives |
Nuxt Sitemap | - Helping crawlers discover new pages - Prioritizing important pages - Managing large sites with many pages |
Nuxt Robots | - Controlling non-HTML resources (PDFs, images) - Managing API responses - Setting crawler directives for files |
Nuxt SEO Utils | - Managing duplicate content from URL parameters - Consolidating mobile/desktop variants - Handling cross-domain syndicated content |
HTTP Redirects | - Moving pages permanently (301 redirects) - Preserving SEO value when restructuring - Managing legacy URLs |
Web Application Firewall | - Blocking malicious bots - Filtering high-volume crawler traffic - Protecting against content scraping |
Quick Implementation Guide
Need to set up crawler control quickly? Here's some recipes:
Block a page from being indexed
<script setup>
useSeoMeta({
robots: 'noindex, follow'
})
</script>
Avoid duplicate content issues
<script setup>
useHead({
// make sure you use an absolute URL
link: [{ rel: 'canonical', href: 'https://mysite.com/secret' }]
})
</script>
Block a group of pages
User-agent: *
Disallow: /admin
Redirect to a new URL
export default defineNuxtConfig({
routeRules: {
'/old-url': {
// keep SEO benefits by using a 301 redirect
redirect: { to: '/new-url', statusCode: 301 }
}
}
})
Why Control Crawlers?
Doing nothing about crawlers is a completely viable solution. You will not be inherently penalized for not managing them.
However, for some sites that are either looking to optimize their organic traffic, protect their content, or reduce server load, managing crawlers can be beneficial.
Improve organic traffic
Making sure that only pages that should be on search engine result pages (SERP) are indexed may improve your organic traffic.
This can especially be a problem when multiple pages are indexed with the same content, leading to duplicate content issues.
For example, having both
As search engine crawlers have a crawl budget telling them how to be more efficient can help them index more of your site more frequently.
Additionally, helping crawlers understand when you move or delete content using HTTP headers will make you don't lose your search engine ranking.
Protecting Content
Most sites will be operating with different environments such as testing, staging, and production.
By default, search engines will index any public environment they can access, this can lead to duplicate content issues and confusion for end-users when these appear in the SERP.
Similarly, we may find our authenticated pages being indexed if we have public pages linking to them.
Reducing server load
While costing almost nothing, you are effectively paying for CPU time everytime a bot visits your site. Filtering incoming crawler requests can be an effective way to reduce server load.
This can be more apparent on pages that are expensive to render, such as pages with a lot of dynamic content or pages using third-party services.
For example:
- You may be paying for crawlers to access your Google Maps embed
- Crawlers may be spending a long time in infinite scroll pages