Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Screen scraping with Python

Does Python have screen scraping libraries that offer JavaScript support?

I've been using pycurl for simple HTML requests, and Java's HtmlUnit for more complicated requests requiring JavaScript support.

Ideally I would like to be able to do everything from Python, but I haven't come across any libraries that would allow me to do it. Do they exist?

like image 432
Marco Avatar asked Feb 03 '10 08:02

Marco


People also ask

Is Python good for Webscraping?

Just like PHP, Python is a popular and best programming language for web scraping. As a Python expert, you can handle multiple data crawling or web scraping tasks comfortably and don't need to learn sophisticated codes. Requests, Scrappy and BeautifulSoup, are the three most famous and widely used Python frameworks.

How do you make a web scraper in Python?

We can do this by right-clicking on the page we want to scrape and select inspect element. After clicking the inspect button the Developer Tools of the browser gets open. Now almost all the browsers come with the developers tools installed, and we will be using Chrome for this tutorial.

Which is better selenium or Beautiful Soup?

The main difference between Selenium and Beautiful Soup is that Selenium is ideal for complex projects while Beautiful Soup is best for smaller projects. Read on to learn more of the differences! The choice between using these two scraping technologies will likely reflect the scope of the project.


1 Answers

There are many options when dealing with static HTML, which the other responses cover. However if you need JavaScript support and want to stay in Python I recommend using webkit to render the webpage (including the JavaScript) and then examine the resulting HTML. For example:

import sys
import signal
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.html = None
        signal.signal(signal.SIGINT, signal.SIG_DFL)
        self.connect(self, SIGNAL('loadFinished(bool)'), self._finished_loading)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _finished_loading(self, result):
        self.html = self.mainFrame().toHtml()
        self.app.quit()


if __name__ == '__main__':
    try:
        url = sys.argv[1]
    except IndexError:
        print 'Usage: %s url' % sys.argv[0]
    else:
        javascript_html = Render(url).html
like image 161
hoju Avatar answered Oct 06 '22 19:10

hoju