I saw this post to make scrapy crawl any site without allowed domains restriction.
Is there any better way of doing it, such as using a regular expression in allowed domains variable, like-
allowed_domains = ["*"]
I hope there is some other way than hacking into scrapy framework to do this.
The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the Spider class.
Spider is a smart point-and-click web scraping tool. With Spider, you can turn websites into organized data, download it as JSON or spreadsheet. There's no coding experience or configuration time involved, simply open the chrome extension and start clicking. 2.0. July 26, 2022.
They are handled by the default parse() method implemented in that class -- look here to read the source. So, whenever you want to trigger the rules for an URL, you just need to yield a scrapy. Request(url, self. parse) , and the Scrapy engine will send a request to that URL and apply the rules to the response.
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Don't set allowed_domains at all.
Look at the get_host_regex() function in this scrapy file:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spidermiddleware/offsite.py
you should diactivate offsite middlware which is a built in spider middleware in scrapy. for more information http://doc.scrapy.org/en/latest/topics/spider-middleware.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With