scrape html generated by javascript with python

Tags:

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?

468

asked Jan 27 '10 16:01

hymloth

2 Answers

In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.

You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.

158

answered Sep 20 '22 09:09

Paul D. Waite

Since there is no comprehensive answer here, I'll go ahead and write one.

To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)

Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.

So here's what you do:

Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk

driver.quit()

answered Sep 21 '22 09:09

bholagabbar

Related questions
                            
                                Using ViewBox to resize svg depending on the window size
                            
                                Are functions objects or types in Javascript?
                            
                                Html over the Canvas?
                            
                                What does RegExp.$1 do
                            
                                How do you send console messages and errors to alert?
                            
                                Disable/Non-Clickable an HTML button in Javascript
                            
                                AppendChild() is not a function javascript
                            
                                JavaScript function declaration, a colon in function declaration
                            
                                Querying DynamoDB with Lambda does nothing
                            
                                Stop propagation doesn't work
                            
                                How do I close a modal window after AJAX success
                            
                                How to reveal a React component on scroll
                            
                                handle-callback-err Expected error to be handled
                            
                                How do I chain multiple conditional promises?
                            
                                react-native fetch() cookie persist
                            
                                How to get all substrings (contiguous subsequences) of my JavaScript array?
                            
                                Using Google Place Autocomplete API in React
                            
                                Is it possible to call an async function inside a constructor in react-native?
                            
                                Are there any Parsing Expression Grammar (PEG) libraries for Javascript or PHP?
                            
                                How to apply 100% height to div?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

scrape html generated by javascript with python

Tags:

python

javascript

browser

screen-scraping

hymloth

People also ask

2 Answers

Paul D. Waite

bholagabbar

Recent Activity

Donate For Us