Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Of scraping data, headless browsers, and Python [closed]

So i'm a CS student trying to learn web scraping and all the do's and dont's that come along with it. After messing about with iMacros and a few other data scraping 'tools', I turned to Python, a language I was not familiar with at the time. I learned about BeautifulSoup and urllib2, and blundered my way through learning it through stackoverflow and a few other forums.

Now, using the knowledge ive gained so far, I can scrape most static web pages. However, we all know that the era of static pages is over, as JS reigns supreme on even mediocre websites now.

I would like someone to please guide me in the right direction here. I want to learn a method to load Javascript-laden webpages, load all the content, and then somehow get this data into the BeautifulSoup function. Urllib2 sucks at that. I would also like the ability to fill in forms and navigate through button clicks.

Mostly the websites im interested in consist of a long list of results that load as you scroll down. Loading them all and then downloading the page doesnt seem to help(Dont know why that is). I'm using Windows 7, and have Python 2.7.5 installed.

I've been told that headless browsers such as zombie or Ghost would help me, but I really dont know much about those. I tried using libraries such as mechanize but they dont really cater for what I need, i.e, loading the results, fetching the webpage, and feeding into BS4.

Bearing in mind my minimal knowledge of Python, could anyone help me out here?

Thanks

like image 877
Hamza Tahir Avatar asked Aug 07 '13 11:08

Hamza Tahir


People also ask

Is Python good for data scraping?

Large Collection of Libraries: Python has a huge collection of libraries such as Numpy, Matlplotlib, Pandas etc., which provides methods and services for various purposes. Hence, it is suitable for web scraping and for further manipulation of extracted data.

What is headless browser scraping?

A headless browser is a web browser with no user interface (UI) whatsoever. Instead, it follows instructions defined by software developers in different programming languages. Headless browsers are mostly used for running automated quality assurance tests, or to scrape websites.

Can websites block web scraping?

If you send repetitive requests from the same IP, the website owners can detect your footprint and may block your web scrapers by checking the server log files. To avoid this, you can use rotating proxies. A rotating proxy is a proxy server that allocates a new IP address from a set of proxies stored in the proxy pool.


1 Answers

Selenium Webdriver with phantomjs can do headless automated browsing of JavaScript-driven webpages. Once installed, it can be used like this:

import contextlib
import selenium.webdriver as webdriver
import bs4 as bs

# define path to the phantomjs binary
phantomjs = 'phantomjs'
url = ...
with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
    driver.get(url)
    content = driver.page_source
    soup = bs.BeautifulSoup(content)

On Ubuntu, they can be installed with

  • sudo pip install -U selenium
  • Download and unpack phantomjs
  • link or move the phantomjs binary to a directory in your PATH

    % cd phantomjs-1.9.0-linux-i686/bin/
    % ln phantomjs ~/bin
    
like image 162
unutbu Avatar answered Nov 14 '22 23:11

unutbu