How to work with RobotsTxtMiddleware in Scrapy framework?

Question

Scrapy framework have RobotsTxtMiddleware. It needs to make sure Scrapy respects robots.txt. It need's to set ROBOTSTXT_OBEY = True in settings, then Scrapy will respect robots.txt policies. I did it and run spider. In debug I Have seen request to http://site_url/robot.txt.

What does this mean, how it works?
How can I work with response?
How can I see and understand rules from robot.txt?

amirouche · Accepted Answer

It's normal that the spyder request robot.txt that's where the rules are.

robot.txt is basically a blacklist of urls that you should not visit/crawl which use glob/regex kind of syntax to specify the forbidden urls.

Scapy will read the robot.txt and translate those rules to code. During the crawl when the spyder meets an url it first validates against the rules generated from the robot.txt that the URL can be visited. If the URL is not blacklisted by robot.txt scrapy will visit the url and deliver a Response.

robot.txt is not only blacklisting urls, it also provide the speed at which the crawl can happen. Here is an example robot.txt:

User-Agent: * 
Disallow: /x?
Disallow: /vote?
Disallow: /reply?
Disallow: /submitted?
Disallow: /submitlink?
Disallow: /threads?
Crawl-delay: 30

How to work with RobotsTxtMiddleware in Scrapy framework?

Tags:

python

scrapy

robots.txt

Max

1 Answers

amirouche

Recent Activity

Donate For Us

How to work with RobotsTxtMiddleware in Scrapy framework?

Tags:

python

scrapy

robots.txt

Max

1 Answers

amirouche

Related questions

Recent Activity

Donate For Us