Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrapy allow all domains

Tags:

python

scrapy

I saw this post to make scrapy crawl any site without allowed domains restriction.

Is there any better way of doing it, such as using a regular expression in allowed domains variable, like-

allowed_domains = ["*"]

I hope there is some other way than hacking into scrapy framework to do this.

like image 849
hrishikeshp19 Avatar asked Mar 03 '12 03:03

hrishikeshp19


People also ask

What does parse function do in Scrapy?

The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the Spider class.

What is spider in scraping?

Spider is a smart point-and-click web scraping tool. With Spider, you can turn websites into organized data, download it as JSON or spreadsheet. There's no coding experience or configuration time involved, simply open the chrome extension and start clicking. 2.0. July 26, 2022.

How do you use Scrapy rules?

They are handled by the default parse() method implemented in that class -- look here to read the source. So, whenever you want to trigger the rules for an URL, you just need to yield a scrapy. Request(url, self. parse) , and the Scrapy engine will send a request to that URL and apply the rules to the response.

What is crawl in Scrapy?

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.


2 Answers

Don't set allowed_domains at all.

Look at the get_host_regex() function in this scrapy file:

https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spidermiddleware/offsite.py

like image 133
Shawn Lewis Avatar answered Oct 04 '22 22:10

Shawn Lewis


you should diactivate offsite middlware which is a built in spider middleware in scrapy. for more information http://doc.scrapy.org/en/latest/topics/spider-middleware.html

like image 45
Jhon Garside Avatar answered Oct 04 '22 22:10

Jhon Garside