Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python scraping of javascript web pages fails for https pages only

I'm using PyQt5 to scrape web pages, which works great for http:// URLs, but not at all for https:// URLs.

The relevant part of my script is below:

class WebPage(QWebPage):
    def __init__(self):
        super(WebPage, self).__init__()

        self.timerScreen = QTimer()
        self.timerScreen.setInterval(2000)
        self.timerScreen.setSingleShot(True)
        self.timerScreen.timeout.connect(self.handleLoadFinished)

        self.loadFinished.connect(self.timerScreen.start)


    def start(self, urls):
        self._urls = iter(urls)
        self.fetchNext()

    def fetchNext(self):
        try:
            url = next(self._urls)
        except StopIteration:
            return False
        else:
            self.mainFrame().load(QUrl(url))
        return True

    def processCurrentPage(self):
        url = self.mainFrame().url().toString()
        html = self.mainFrame().toHtml()

        #Do stuff with html
        print('loaded: [%d bytes] %s' % (self.bytesReceived(), url))

    def handleLoadFinished(self):
        self.processCurrentPage()
        if not self.fetchNext():
            qApp.quit()

For secure pages, the script returns a blank page. The only html coming back is <html><head></head><body></body></html>.

I'm at a bit of a loss. Is there a setting that I'm missing related to handling secure URLs?

like image 508
seymourgoestohollywood Avatar asked Oct 01 '16 07:10

seymourgoestohollywood


1 Answers

If you're on windows, please try this: Build PyQt5 on Windows with OpenSSL support?

Have you considered using Beautiful Soup or Scrapy.

I have used Beautiful Soup for my project and it worked like a charm. It has SSL support too.

like image 100
Abhishek Menon Avatar answered Oct 15 '22 09:10

Abhishek Menon