Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to work with RobotsTxtMiddleware in Scrapy framework?

Scrapy framework have RobotsTxtMiddleware. It needs to make sure Scrapy respects robots.txt. It need's to set ROBOTSTXT_OBEY = True in settings, then Scrapy will respect robots.txt policies. I did it and run spider. In debug I Have seen request to http://site_url/robot.txt.

  1. What does this mean, how it works?
  2. How can I work with response?
  3. How can I see and understand rules from robot.txt?
like image 638
Max Avatar asked May 23 '15 16:05

Max


1 Answers

It's normal that the spyder request robot.txt that's where the rules are.

robot.txt is basically a blacklist of urls that you should not visit/crawl which use glob/regex kind of syntax to specify the forbidden urls.

Scapy will read the robot.txt and translate those rules to code. During the crawl when the spyder meets an url it first validates against the rules generated from the robot.txt that the URL can be visited. If the URL is not blacklisted by robot.txt scrapy will visit the url and deliver a Response.

robot.txt is not only blacklisting urls, it also provide the speed at which the crawl can happen. Here is an example robot.txt:

User-Agent: * 
Disallow: /x?
Disallow: /vote?
Disallow: /reply?
Disallow: /submitted?
Disallow: /submitlink?
Disallow: /threads?
Crawl-delay: 30
like image 153
amirouche Avatar answered Oct 29 '22 05:10

amirouche