Of scraping data, headless browsers, and Python [closed]

Tags:

So i'm a CS student trying to learn web scraping and all the do's and dont's that come along with it. After messing about with iMacros and a few other data scraping 'tools', I turned to Python, a language I was not familiar with at the time. I learned about BeautifulSoup and urllib2, and blundered my way through learning it through stackoverflow and a few other forums.

Now, using the knowledge ive gained so far, I can scrape most static web pages. However, we all know that the era of static pages is over, as JS reigns supreme on even mediocre websites now.

I would like someone to please guide me in the right direction here. I want to learn a method to load Javascript-laden webpages, load all the content, and then somehow get this data into the BeautifulSoup function. Urllib2 sucks at that. I would also like the ability to fill in forms and navigate through button clicks.

Mostly the websites im interested in consist of a long list of results that load as you scroll down. Loading them all and then downloading the page doesnt seem to help(Dont know why that is). I'm using Windows 7, and have Python 2.7.5 installed.

I've been told that headless browsers such as zombie or Ghost would help me, but I really dont know much about those. I tried using libraries such as mechanize but they dont really cater for what I need, i.e, loading the results, fetching the webpage, and feeding into BS4.

Bearing in mind my minimal knowledge of Python, could anyone help me out here?

Thanks

877

asked Aug 07 '13 11:08

Hamza Tahir

1 Answers

Selenium Webdriver with phantomjs can do headless automated browsing of JavaScript-driven webpages. Once installed, it can be used like this:

import contextlib
import selenium.webdriver as webdriver
import bs4 as bs

# define path to the phantomjs binary
phantomjs = 'phantomjs'
url = ...
with contextlib.closing(webdriver.PhantomJS(phantomjs)) as driver:
    driver.get(url)
    content = driver.page_source
    soup = bs.BeautifulSoup(content)

On Ubuntu, they can be installed with

sudo pip install -U selenium
Download and unpack phantomjs
link or move the phantomjs binary to a directory in your PATH
```
% cd phantomjs-1.9.0-linux-i686/bin/
% ln phantomjs ~/bin
```

162

answered Nov 14 '22 23:11

unutbu

Related questions
                            
                                Unable to load javascript library into Meteor app
                            
                                jQuery Mobile popup is not opening on .popup('open')
                            
                                How can I use these Node modules to accept HTML through a file or URL and then output JSON as validation of existing HTML elements?
                            
                                ASP NET MVC array in controller to client side array
                            
                                Should I use == or === In Javascript? [duplicate]
                            
                                How do we store a string in localStorage?
                            
                                Regular Expression always returns true
                            
                                How to draw existing <img> to canvas
                            
                                Date format convert javascript
                            
                                val() loop possibilities opposed to each() [duplicate]
                            
                                JWPlayer can't play video
                            
                                Disable click event on text selection
                            
                                Find All the controls in a form using jQuery or javascript
                            
                                How to select only certain part in a match?
                            
                                How to parse a string containing text for a number/float in javascript?
                            
                                Why does element.style.display is blank if display declared on CSS [duplicate]
                            
                                Angular UI TinyMCE : How set default settings
                            
                                casperjs: evaluating document.querySelector returns a null
                            
                                Compile a TypeScript String to a Javascript String programmatically
                            
                                Setting response header with javascript

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Of scraping data, headless browsers, and Python [closed]

Tags:

python

javascript

web-scraping

screen-scraping

Hamza Tahir

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us