Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML page vastly different when using a headless webkit implementation using PyQT

I was under the impression that using a headless browser implementation of webkit using PyQT will automatically get me the html code for each URL even with heavy JS code in it. But I am only seeing it partially. I am comparing with the page I get when I save the page from the firefox window.

I am using the following code -

class JabbaWebkit(QWebPage):
    # 'html' is a class variable

    def __init__(self, url, wait, app, parent=None):
        super(JabbaWebkit, self).__init__(parent)
        JabbaWebkit.html = ''

        if wait:
            QTimer.singleShot(wait * SEC, app.quit)
        else:
            self.loadFinished.connect(app.quit)

        self.mainFrame().load(QUrl(url))

    def save(self):
        JabbaWebkit.html = self.mainFrame().toHtml()

    def userAgentForUrl(self, url):
        return USER_AGENT


    def get_page(url, wait=None):
        # here is the trick how to call it several times
        app = QApplication.instance() # checks if QApplication already exists

        if not app: # create QApplication if it doesnt exist
            app = QApplication(sys.argv)
        #
        form = JabbaWebkit(url, wait, app)
        app.aboutToQuit.connect(form.save)
        app.exec_()
        return JabbaWebkit.html

Can some one see anything obviously wrong with the code?

After running the code through a few URLs, here is one I found that shows the problems I am running into quite clearly - http://www.chilis.com/EN/Pages/menu.aspx

Thanks for any pointers.

like image 552
user220201 Avatar asked Nov 11 '22 20:11

user220201


1 Answers

The page have ajax code, when it finish load, it still need some time to update the page with ajax. But you code will quit when it finish load.

You should add some code like this to wait some time and process events in webkit:

for i in range(200): #wait 2 seconds
    app.processEvents()
    time.sleep(0.01)
like image 52
user2647646 Avatar answered Nov 15 '22 00:11

user2647646