web scraping dynamic content with python

Tags:

I'd like to use Python to scrape the contents of the "Were you looking for these authors:" box on web pages like this one: http://academic.research.microsoft.com/Search?query=lander

Unfortunately the contents of the box get loaded dynamically by JavaScript. Usually in this situation I can read the Javascript to figure out what's going on, or I can use an browser extension like Firebug to figure out where the dynamic content is coming from. No such luck this time...the Javascript is pretty convoluted and Firebug doesn't give many clues about how to get at the content.

Are there any tricks that will make this task easy?

600

asked Jul 12 '13 06:07

Jeff

1 Answers

Instead of trying to reverse engineer it, you can use ghost.py to directly interact with JavaScript on the page.

If you run the following query in a chrome console, you'll see it returns everything you want.

document.getElementsByClassName('inline-text-org');

Returns

[<div class="inline-text-org" title="University of Manchester">University of Manchester</div>, 
 <div class="inline-text-org" title="University of California Irvine">University of California ...</div>
  etc...

You can run JavaScript through python in a real life DOM using ghost.py.

This is really cool:

from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://academic.research.microsoft.com/Search?query=lander')
result, resources = ghost.evaluate(
    "document.getElementsByClassName('inline-text-org');")

105

answered Oct 10 '22 17:10

Nick C.

Related questions
                            
                                Python split string by pattern
                            
                                Beautifulsoup find element by text using `find_all` no matter if there are elements in it
                            
                                How to use dorpi5 or dop853 in Python
                            
                                python os.listdir doesn't show all files
                            
                                How to remove tabs and newlines with a regex
                            
                                Get body text of an email using python imap and email package
                            
                                Counting number of columns in text file with Python
                            
                                Strange conversion in Python logic expressions
                            
                                Use **kwargs both in function calling and definition
                            
                                Exit while loop in Python
                            
                                python map function iteration
                            
                                Pickle Queue objects in python
                            
                                django-allauth configuration doubts
                            
                                Quick way to reject a list in Python
                            
                                Suppress unicode prefix on strings when using pprint
                            
                                Python decorator optional argument
                            
                                Finding All Defined Functions in Python Environment
                            
                                Returning a row from a CSV, if specified value within the row matches condition
                            
                                Matplotlib, adding text with more than one line. Adding text that can follow the curve
                            
                                Mongodb replica set auto reconect don't work after down and up for nginx + uwsgi with several processes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

web scraping dynamic content with python

Tags:

python

web-scraping

screen-scraping

Jeff

People also ask

1 Answers

Nick C.

Recent Activity

Donate For Us