Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

web scraping dynamic content with python

I'd like to use Python to scrape the contents of the "Were you looking for these authors:" box on web pages like this one: http://academic.research.microsoft.com/Search?query=lander

Unfortunately the contents of the box get loaded dynamically by JavaScript. Usually in this situation I can read the Javascript to figure out what's going on, or I can use an browser extension like Firebug to figure out where the dynamic content is coming from. No such luck this time...the Javascript is pretty convoluted and Firebug doesn't give many clues about how to get at the content.

Are there any tricks that will make this task easy?

like image 600
Jeff Avatar asked Jul 12 '13 06:07

Jeff


People also ask

Can you scrape dynamic content from a website?

There are two approaches to scraping a dynamic webpage: Scrape the content directly from the JavaScript. Scrape the website as we view it in our browser — using Python packages capable of executing the JavaScript.

Can BeautifulSoup scraping dynamic content?

It is ideally not possible because BeautifulSoup is just an HTML parser. So in those scenarios it is better to use Selenium to pull dynamic content. Yes if we cant see tables in the HTML body then those are dynamically generated through scripts.


1 Answers

Instead of trying to reverse engineer it, you can use ghost.py to directly interact with JavaScript on the page.

If you run the following query in a chrome console, you'll see it returns everything you want.

document.getElementsByClassName('inline-text-org');

Returns

[<div class=​"inline-text-org" title=​"University of Manchester">​University of Manchester​</div>, 
 <div class=​"inline-text-org" title=​"University of California Irvine">​University of California ...​</div>​
  etc...

You can run JavaScript through python in a real life DOM using ghost.py.

This is really cool:

from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://academic.research.microsoft.com/Search?query=lander')
result, resources = ghost.evaluate(
    "document.getElementsByClassName('inline-text-org');")
like image 105
Nick C. Avatar answered Oct 10 '22 17:10

Nick C.