"Lighthouse was unable to download a robots.txt file" despite the file being accessible

Question

I have a NodeJS/NextJS app running at http://www.schandillia.com. The project has a robots.txt file accessible at http://www.schandillia.com/robots.txt. As of now, the file is bare-bones for testing purposes:

User-agent: *
Allow: /

However, when I run a Lighthouse audit on my site, it throws a Crawling and Indexing error saying it couldn't download a robots.txt file. I repeat, the file is available at http://www.schandillia.com/robots.txt.

The project's codebase, should you need to take a look, is up at https://github.com/amitschandillia/proost. The robots.txt file is located at proost/web/static/ but accessible at root thanks to the following in my Nginx config:

# ... the rest of your configuration
  location = /robots.txt {
    proxy_pass http://127.0.0.1:3000/static/robots.txt;
  }

The complete config file is available for review on github at https://github.com/amitschandillia/proost/blob/master/.help_docs/configs/nginx.conf.

Please advice if there's something I'm overlooking.

Eric Redon · Accepted Answer

TL;DR: Your robots.txt is served fine, but Lighthouse can not fetch it properly because its audit can currently not work with the connect-src directive of of your site’s Content Security Policy, due to a known limitation which ~~is being tracked as issue #4386~~ was fixed in Chrome 92.

Explanation: Lighthouse attempts to fetch the robots.txt file by way of a script ran from the document served by the root of your site. Here is the code it uses to perform this request (found in lighthouse-core):

const response = await fetch(new URL('/robots.txt', location.href).href);

If you try to run this code from your site, you will notice that a “Refused to connect” error is thrown:

Screenshot of the “Refused to connect” error

This error happens because the browser enforces the Content Security Policy restrictions from the headers served by your site (split on several lines for readability):

content-security-policy:
    default-src 'self';
    script-src 'self' *.google-analytics.com;
    img-src 'self' *.google-analytics.com;
    connect-src 'none';
    style-src 'self' 'unsafe-inline' fonts.googleapis.com;
    font-src 'self' fonts.gstatic.com;
    object-src 'self';
    media-src 'self';
    frame-src 'self'

Notice the connect-src 'none'; part. Per the CSP spec, it means that no URL can be loaded using script interfaces from within the served document. In practice, any fetch is refused.

This header is explicitly sent by the server layer of your by Next.js application, because of the way you configured your Content Security Policy middleware (from commit a6aef0e):

import csp from 'helmet-csp';

server.use(csp({
  directives: {
    defaultSrc: ["'self'"],
    scriptSrc: ["'self'", '*.google-analytics.com'],
    imgSrc: ["'self'", '*.google-analytics.com'],
    connectSrc: ["'none'"],
    styleSrc: ["'self'", "'unsafe-inline'", 'maxcdn.bootstrapcdn.com'], // Remove unsafe-inline for better security
    fontSrc: ["'self'"],
    objectSrc: ["'self'"],
    mediaSrc: ["'self'"],
    frameSrc: ["'self'"]
  }
}));

Solution/Workaround: To solve the problem in the audit report, you can either:

wait for (or submit) a fix in Lighthouse
use the connect-src 'self' directive, which will have the side effect of allowing HTTP requests from the browser side of your Next.js app

"Lighthouse was unable to download a robots.txt file" despite the file being accessible

Tags:

node.js

content-security-policy

robots.txt

next.js

lighthouse

TheLearner

1 Answers

Eric Redon

Recent Activity

Donate For Us

"Lighthouse was unable to download a robots.txt file" despite the file being accessible

Tags:

node.js

content-security-policy

robots.txt

next.js

lighthouse

TheLearner

1 Answers

Eric Redon

Related questions

Recent Activity

Donate For Us