So basically I'm trying to scrap the javascript generated data from a website. To do this, I'm using the Python library requests_html.
Here is my code :
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://myurl'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
payload = {'mylog': 'root', 'mypass': 'root'}
r = session.post(url, headers=headers, verify=False, data=payload)
r.html.render()
load = r.html.find('#load_span', first=True)
print (load.text)
If I don't use the render() function, I can connect to the website and my scraped data is null (which is normal) but when I use it, I have this error :
pyppeteer.errors.PageError: net::ERR_CERT_COMMON_NAME_INVALID at https://myurl
or
net::ERR_CERT_WEAK_SIGNATURE_ALGORITHM
I assume the parameter "verify=False" of session.post is ignored by the render. How do I do it ?
Edit : If you want to reproduce the error :
from requests_html import HTMLSession
import requests
session = HTMLSession()
url = 'https://wrong.host.badssl.com'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = session.post(url, headers=headers, verify=False)
r.html.render()
load = r.html.find('#content', first=True)
print (load)
For disabling invalid SSL error, first, open Google Chrome and type chrome://flags into the address bar and hit the Enter button. Once the flags screen open, look for #allow-insecure-localhost. The “Allow invalid certificates for resources loaded from localhost” option will come up.
To bypass SSL certificate validation for local and test servers, you can pass the -k or --insecure option to the Curl command. This option explicitly tells Curl to perform "insecure" SSL connections and file transfers. Curl will ignore any security warnings about an invalid SSL certificate and accept it as valid.
The only way is to set the ignoreHTTPSErrors
parameter in pyppeteer. The problem is that requests_html doesn't provide any way to set this parameter, in fact, there is an issue about it. My advice is to ping again the developers by adding another message here.
Or maybe you can pull this new feature.
Another way is to use Selenium.
EDIT:
I added verify=False
as a feature with a pull request (accepted). Now is possible to ignore the SSL error :)
It's not a parameter of the Get() set it when you instantiate the object:
session = HTMLSession(verify=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With