Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading dynamically generated web pages using python

Tags:

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes. What I see in the source code is:

<div id="cntnt"></div> 

But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.

I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !

like image 708
Ajay Nair Avatar asked Dec 19 '12 20:12

Ajay Nair


People also ask

How do you scrape data from a dynamic website using Python?

Now, provide the url which we want to open in that web browser now controlled by our Python script. Now, we can use ID of the search toolbox for setting the element to select. driver. find_element_by_id('search_term').

Can we make dynamic website using Python?

Discover the concepts of creating dynamic web pages (HTML) with Python. This book reviews several methods available to serve up dynamic HTML including CGI, SSI, Django, and Flask. You will start by covering HTML pages and CSS in general and then move on to creating pages via CGI.


1 Answers

You need JavaScript Engine to parse and run JavaScript code inside the page. There are a bunch of headless browsers that can help you

http://code.google.com/p/spynner/

http://phantomjs.org/

http://zombie.labnotes.org/

http://github.com/ryanpetrello/python-zombie

http://jeanphix.me/Ghost.py/

http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

like image 183
Andrey Nikishaev Avatar answered Oct 06 '22 01:10

Andrey Nikishaev