Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Scraping JavaScript using Selenium and Beautiful Soup

I'm trying to scrape a JavaScript enables page using BS and Selenium. I have the following code so far. It still doesn't somehow detect the JavaScript (and returns a null value). In this case I'm trying to scrape the Facebook comments in the bottom. (Inspect element shows the class as postText)
Thanks for the help!

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
import BeautifulSoup

browser = webdriver.Firefox()  
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup.BeautifulSoup(html_source)  
comments = soup("div", {"class":"postText"})  
print comments
like image 913
Jay Setti Avatar asked Jan 25 '13 20:01

Jay Setti


People also ask

Can I use Selenium and BeautifulSoup together?

The combination of Beautiful Soup and Selenium will do the job of dynamic scraping. Selenium automates web browser interaction from python. Hence the data rendered by JavaScript links can be made available by automating the button clicks with Selenium and then can be extracted by Beautiful Soup.

How do you use BeautifulSoup in Python with Selenium?

You can use pip in the terminal to do so. As always we'll start off by importing the libraries we need. We'll be using re , the regex module to extract our links from Beautiful Soup. The webdriver submodule from selenium as well as the Service submodule from selenium 's chrome webdriver are needed to run the webdriver.

Can Selenium scrape JavaScript?

The Selenium browser driver is typically used to scrape data from dynamic websites that use JavaScript (although it can scrape data from static websites too). The use of JavaScript can vary from simple form events to single page apps that download all their content after loading.

Can you use BeautifulSoup for JavaScript?

BeautifulSoup gets data using request or urllib and that data is the page source of that website. To load the javascript or better say to render it, there need to be some delay. But this is not possible in case of BeautifulSoup. Hence BeautifulSoup doesn't scrape javascript data.


1 Answers

There are some mistakes in your code that are fixed below. However, the class "postText" must exist elsewhere, since it is not defined in the original source code. My revised version of your code was tested and is working on multiple websites.

from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
from bs4 import BeautifulSoup

browser = webdriver.Firefox()  
browser.get('http://techcrunch.com/2012/05/15/facebook-lightbox/')  
html_source = browser.page_source  
browser.quit()

soup = BeautifulSoup(html_source,'html.parser')  
#class "postText" is not defined in the source code
comments = soup.findAll('div',{'class':'postText'})  
print comments
like image 67
user3186527 Avatar answered Sep 20 '22 19:09

user3186527