Controlling Web Crawlers in Vue & Nuxt

Learn how to effectively manage web crawlers in Vue and Nuxt applications to optimize SEO and protect your content.
10 mins read
Last Updated
Published

Introduction

A web crawler, also called a "spider" or "robot", is simply a program scanning web pages for data. While crawlers vary rapidly in complexity, we can build our own in one line of code.

curl https://nuxtseo.com -H "User-Agent: NuxtSEOBot"

Most crawlers will extract all links from the response and crawl all pages from there.

There are many categories of web crawler, for example:

Search Engines

These crawlers are accessing your site to find pages that they can "index". An indexed page is one that will appear in the search engine results page (SERP).

Learn more in our meta tags guide.

Social

These crawlers are mostly accessing your site to generate previews when shared on their platform.

See our canonical URLs guide for ensuring consistent social previews.

AI Crawlers

These crawlers are accessing your site to generate content or data for their AI models.

Learn to manage AI crawler access in our security guide.

Malicious Crawlers

While most of these crawlers can be good in the right context, some can be malicious or just not useful.

Malicious crawlers can easily mask the user agent of a regular user and ignore your robot rules while they search for vulnerabilities or scrape your content.

# Pretend to be a Chrome browser on Linux, see if they leaked their .env
curl https://nuxtseo.com/.env -H "User-Agent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36'"

Controlling Crawlers

Your site is most likely already getting thousands of these bot visits.

At a high level, we control these bots using:

Control MechanismBest Used When
robots.txt

Nuxt Robots
- Blocking large sections of your site (e.g., /admin/*)
- Managing crawler bandwidth on heavy pages
- Preventing crawling of development assets
<meta name="robots">

Nuxt Robots
- Controlling indexing for individual pages
- Managing dynamic content (search results, filtered pages)
- Setting page-specific crawler directives
sitemap.xml

Nuxt Sitemap
- Helping crawlers discover new pages
- Prioritizing important pages
- Managing large sites with many pages
X-Robots-Tag

Nuxt Robots
- Controlling non-HTML resources (PDFs, images)
- Managing API responses
- Setting crawler directives for files
<link rel="canonical">

Nuxt SEO Utils
- Managing duplicate content from URL parameters
- Consolidating mobile/desktop variants
- Handling cross-domain syndicated content
HTTP Redirects- Moving pages permanently (301 redirects)
- Preserving SEO value when restructuring
- Managing legacy URLs
Web Application Firewall- Blocking malicious bots
- Filtering high-volume crawler traffic
- Protecting against content scraping

Quick Implementation Guide

Need to set up crawler control quickly? Here's some recipes:

Block a page from being indexed

pages/secret.vue
<script setup>
useSeoMeta({
  robots: 'noindex, follow'
})
</script>

Avoid duplicate content issues

pages/secret.vue
<script setup>
useHead({
  // make sure you use an absolute URL
  link: [{ rel: 'canonical', href: 'https://mysite.com/secret' }]
})
</script>

Block a group of pages

public/robots.txt
User-agent: *
Disallow: /admin

Redirect to a new URL

export default defineNuxtConfig({
  routeRules: {
    '/old-url': {
      // keep SEO benefits by using a 301 redirect
      redirect: { to: '/new-url', statusCode: 301 }
    }
  }
})

Why Control Crawlers?

Doing nothing about crawlers is a completely viable solution. You will not be inherently penalized for not managing them.

However, for some sites that are either looking to optimize their organic traffic, protect their content, or reduce server load, managing crawlers can be beneficial.

Improve organic traffic

Making sure that only pages that should be on search engine result pages (SERP) are indexed may improve your organic traffic.

This can especially be a problem when multiple pages are indexed with the same content, leading to duplicate content issues.

For example, having both /about and /about/ without a canonical tag will cause them to be indexed separately, which may cause them to compete against each other.

As search engine crawlers have a crawl budget telling them how to be more efficient can help them index more of your site more frequently.

Additionally, helping crawlers understand when you move or delete content using HTTP headers will make you don't lose your search engine ranking.

Protecting Content

Most sites will be operating with different environments such as testing, staging, and production.

By default, search engines will index any public environment they can access, this can lead to duplicate content issues and confusion for end-users when these appear in the SERP.

Similarly, we may find our authenticated pages being indexed if we have public pages linking to them.

Reducing server load

While costing almost nothing, you are effectively paying for CPU time everytime a bot visits your site. Filtering incoming crawler requests can be an effective way to reduce server load.

This can be more apparent on pages that are expensive to render, such as pages with a lot of dynamic content or pages using third-party services.

For example:

  • You may be paying for crawlers to access your Google Maps embed
  • Crawlers may be spending a long time in infinite scroll pages

Continue reading about implementation details ->