Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrape html generated by javascript with python

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this button with python code? Can scrapy help me? I captured the POST request with firebug but when I try to pass it on the url I get a 403 error. Any suggestions?

like image 468
hymloth Avatar asked Jan 27 '10 16:01

hymloth


People also ask

Does BeautifulSoup work with JavaScript?

Beautiful Soup doesn't mimic a client. Javascript is code that runs on the client. With Python, we simply make a request to the server, and get the server's response, which is the starting text, along of course with the javascript, but it's the browser that reads and runs that javascript.

Can you scrape with JavaScript?

Benefits of Web Scraping with JavaScriptGathering data from different sources for analysis can be automated with web scraping easily. It can be used to collect data for testing and training machine learning models.

How do you scrape an HTML file in Python?

BeautifulSoup module in Python allows us to scrape data from local HTML files. For some reason, website pages might get stored in a local (offline environment), and whenever in need, there may be requirements to get the data from them.


2 Answers

In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.

You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.

like image 158
Paul D. Waite Avatar answered Sep 20 '22 09:09

Paul D. Waite


Since there is no comprehensive answer here, I'll go ahead and write one.

To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)

Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.

So here's what you do:

Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk

driver.quit()
like image 35
bholagabbar Avatar answered Sep 21 '22 09:09

bholagabbar