So i'm a CS student trying to learn web scraping and all the do's and dont's that come along with it. After messing about with iMacros and a few other data scraping 'tools', I turned to Python, a language I was not familiar with at the time. I learned about BeautifulSoup and urllib2, and blundered my way through learning it through stackoverflow and a few other forums.
Now, using the knowledge ive gained so far, I can scrape most static web pages. However, we all know that the era of static pages is over, as JS reigns supreme on even mediocre websites now.
I would like someone to please guide me in the right direction here. I want to learn a method to load Javascript-laden webpages, load all the content, and then somehow get this data into the BeautifulSoup function. Urllib2 sucks at that. I would also like the ability to fill in forms and navigate through button clicks.
Mostly the websites im interested in consist of a long list of results that load as you scroll down. Loading them all and then downloading the page doesnt seem to help(Dont know why that is). I'm using Windows 7, and have Python 2.7.5 installed.
I've been told that headless browsers such as zombie or Ghost would help me, but I really dont know much about those. I tried using libraries such as mechanize but they dont really cater for what I need, i.e, loading the results, fetching the webpage, and feeding into BS4.
Bearing in mind my minimal knowledge of Python, could anyone help me out here?
Thanks
Large Collection of Libraries: Python has a huge collection of libraries such as Numpy, Matlplotlib, Pandas etc., which provides methods and services for various purposes. Hence, it is suitable for web scraping and for further manipulation of extracted data.
A headless browser is a web browser with no user interface (UI) whatsoever. Instead, it follows instructions defined by software developers in different programming languages. Headless browsers are mostly used for running automated quality assurance tests, or to scrape websites.
If you send repetitive requests from the same IP, the website owners can detect your footprint and may block your web scrapers by checking the server log files. To avoid this, you can use rotating proxies. A rotating proxy is a proxy server that allocates a new IP address from a set of proxies stored in the proxy pool.
Selenium Webdriver with phantomjs can do headless automated browsing of JavaScript-driven webpages. Once installed, it can be used like this:
import contextlib
import selenium.webdriver as webdriver
import bs4 as bs
# define path to the phantomjs binary
phantomjs = 'phantomjs'
url = ...
with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
driver.get(url)
content = driver.page_source
soup = bs.BeautifulSoup(content)
On Ubuntu, they can be installed with
sudo pip install -U selenium
link or move the phantomjs binary to a directory in your PATH
% cd phantomjs-1.9.0-linux-i686/bin/
% ln phantomjs ~/bin
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With