Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crawling SSL site with scrapy

Tags:

python

ssl

scrapy

I've to crawl https://dms.psc.sc.gov/Web/dockets which uses TLS v1.2 using scrapy framework. But in requesting the URL it fails to load and raise [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>].

There is issue discussed on git https://github.com/scrapy/scrapy/issues/981 but it did not work for me. I have scrapy v 0.24.5 and twisted version >=14.

When I try to crawl another site which also uses TLS v1.2 it works but not for the https://dms.psc.sc.gov. How to solve this issue?

like image 306
Hassan Raza Avatar asked Jun 24 '15 13:06

Hassan Raza


People also ask

How to solve HTTP-S problem with scrapy?

In general, if you stumble on HTTP-s problem with Scrapy the solution is: check what version of Twisted you use, if it's not most recent update to most recent Twisted version (as of time of writing versions above 14 are confirmed to be significantly better when it comes to SSL)

What is Scrapy in Python web crawling?

As web crawling is defined as “programmatically going over a collection of web pages and extracting data”, it is a helpful trick to collect data without an official API. Scrapy is a powerful tool when using python in web crawling. In our command line, execute: In this article, we will use Yummly as an example.

What is web scraping?

Web scraping, web crawling, web harvesting, or web data extraction are synonyms referring to the act of mining data from web pages across the Internet. Web scrapers or web crawlers are tools that go over web pages programmatically extracting the required data.

What are the best resources for web crawling in Python?

In data analytics, the most important resource is the data itself. As web crawling is defined as “programmatically going over a collection of web pages and extracting data”, it is a helpful trick to collect data without an official API. Scrapy is a powerful tool when using python in web crawling.


1 Answers

PR fixing this problem in Scrapy was already merged. Recently (in February 2016) there was another pull request fixing similar bug

I see with most recent Scrapy version I can fetch your page all right, but with older versions problem still appears.

In general, if you stumble on HTTP-s problem with Scrapy the solution is:

  • upgrade Scrapy to newest version
  • check what version of Twisted you use, if it's not most recent update to most recent Twisted version (as of time of writing versions above 14 are confirmed to be significantly better when it comes to SSL)

If you still experience problems after updating Scrapy and Twisted you may need to subclass ScrapyClientContextFactory - see answer below for details.

More details in this github issue

like image 80
Pawel Miech Avatar answered Sep 30 '22 03:09

Pawel Miech