Scrapy framework have RobotsTxtMiddleware. It needs to make sure Scrapy respects robots.txt. It need's to set ROBOTSTXT_OBEY = True
in settings, then Scrapy will respect robots.txt policies. I did it and run spider. In debug I Have seen request to http://site_url/robot.txt.
It's normal that the spyder request robot.txt
that's where the rules are.
robot.txt
is basically a blacklist of urls that you should not visit/crawl which use glob/regex kind of syntax to specify the forbidden urls.
Scapy will read the robot.txt
and translate those rules to code. During the crawl when the spyder meets an url it first validates against the rules generated from the robot.txt
that the URL can be visited. If the URL is not blacklisted by robot.txt
scrapy will visit the url and deliver a Response
.
robot.txt
is not only blacklisting urls, it also provide the speed at which the crawl can happen. Here is an example robot.txt
:
User-Agent: *
Disallow: /x?
Disallow: /vote?
Disallow: /reply?
Disallow: /submitted?
Disallow: /submitlink?
Disallow: /threads?
Crawl-delay: 30
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With