I've to crawl https://dms.psc.sc.gov/Web/dockets which uses TLS v1.2 using scrapy framework. But in requesting the URL it fails to load and raise [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
.
There is issue discussed on git https://github.com/scrapy/scrapy/issues/981 but it did not work for me. I have scrapy v 0.24.5 and twisted version >=14.
When I try to crawl another site which also uses TLS v1.2 it works but not for the https://dms.psc.sc.gov. How to solve this issue?
In general, if you stumble on HTTP-s problem with Scrapy the solution is: check what version of Twisted you use, if it's not most recent update to most recent Twisted version (as of time of writing versions above 14 are confirmed to be significantly better when it comes to SSL)
As web crawling is defined as “programmatically going over a collection of web pages and extracting data”, it is a helpful trick to collect data without an official API. Scrapy is a powerful tool when using python in web crawling. In our command line, execute: In this article, we will use Yummly as an example.
Web scraping, web crawling, web harvesting, or web data extraction are synonyms referring to the act of mining data from web pages across the Internet. Web scrapers or web crawlers are tools that go over web pages programmatically extracting the required data.
In data analytics, the most important resource is the data itself. As web crawling is defined as “programmatically going over a collection of web pages and extracting data”, it is a helpful trick to collect data without an official API. Scrapy is a powerful tool when using python in web crawling.
PR fixing this problem in Scrapy was already merged. Recently (in February 2016) there was another pull request fixing similar bug
I see with most recent Scrapy version I can fetch your page all right, but with older versions problem still appears.
In general, if you stumble on HTTP-s problem with Scrapy the solution is:
If you still experience problems after updating Scrapy and Twisted you may need to subclass ScrapyClientContextFactory - see answer below for details.
More details in this github issue
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With