I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes. What I see in the source code is: <pre class="prettyprint"><code><div id="cntnt"></div> </code></pre> But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part. I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !

You need JavaScript Engine to parse and run JavaScript code inside the page. There are a bunch of headless browsers that can help you http://code.google.com/p/spynner/ http://phantomjs.org/ http://zombie.labnotes.org/ http://github.com/ryanpetrello/python-zombie http://jeanphix.me/Ghost.py/ http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

Reading dynamically generated web pages using python

Tags:

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes. What I see in the source code is:

<div id="cntnt"></div>

But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.

I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !

708

asked Dec 19 '12 20:12

Ajay Nair

1 Answers

You need JavaScript Engine to parse and run JavaScript code inside the page. There are a bunch of headless browsers that can help you

http://code.google.com/p/spynner/

http://phantomjs.org/

http://zombie.labnotes.org/

http://github.com/ryanpetrello/python-zombie

http://jeanphix.me/Ghost.py/

http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

183

answered Oct 06 '22 01:10

Andrey Nikishaev

Related questions
                            
                                Extended computation expressions without for..in..do
                            
                                Which setup is more efficient? Flask with pypy, or Flask with gevent?
                            
                                Install Chrome for Android in Android emulator
                            
                                Adding list items with SharePoint 2013 REST API
                            
                                Overlay HTML5 canvas over image
                            
                                Facebook: Unsafe JavaScript issue (document.domain values should be same)
                            
                                Can't run MSTest unit tests via Resharper after upgrading to VS 2012 Update 2
                            
                                How can I put the output of a Chef 'execute resource' into a variable
                            
                                How do you configure pypi under Windows?
                            
                                Creating a three states checkbox on android
                            
                                What is the C# equivalent to Java's Throwable?
                            
                                PyQt: RuntimeError: wrapped C/C++ object has been deleted

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With