Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selenium download full html page

I am learning to use Python Selenium and BeautifulSoup for web scraping. Currently, I am trying to scrape the hot searches on Google search trends http://www.google.com/trends/hottrends#pn=p5

This is my current code. However, I realized the full html is not downloaded and I only have content from the most recent few dates. What can I do to rectify this problem?

from selenium import webdriver
from bs4 import BeautifulSoup

googleURL = "http://www.google.com/trends/hottrends#pn=p5"

browser = webdriver.Firefox()
browser.get(googleURL)
content = browser.page_source

soup = BeautifulSoup(content)
print soup
like image 626
user2392965 Avatar asked May 17 '13 07:05

user2392965


People also ask

How do I save a webpage using Selenium?

To save a page we shall first obtain the page source behind the webpage with the help of the page_source method. We shall open a file with a particular encoding with the codecs. open method. The file has to be opened in the write mode represented by w and encoding type as utf−8.

Does Selenium work with HTML?

Selenium is a Python module for browser automation. You can use it to grab HTML code, what webpages are made of: HyperText Markup Language (HTML).


1 Answers

Users add more content to the page (from previous dates) by clicking the <div onclick="control.moreData()" id="moreLink">More...</div> element at the bottom of the page.

So to get your desired content, you could use Selenium to click the id="moreLink" element or execute some JavaScript to call control.moreData(); in a loop.

For example, if you want to get all content as far back as Friday, February 15, 2013 (it looks like a string of this format exists for every date, for loaded content) your python might look something like this:

content = browser.page_source
desired_content_is_loaded = false;
while (desired_content_is_loaded == false):
     if not "Friday, February 15, 2013" in content:
          sel.run_script("control.moreData();")
          content = browser.page_source
     else:
          desired_content_is_loaded = true;

EDIT:

If you disable JavaScript in your browser and reload the page, you will see that there is no "trends" content at all. What that tells me, is that the those items are loaded dynamically. Meaning, they are not part of the HTML document which is downloaded when you open the page. Selenium's .get() waits for the HTML document to load, but not for all JS to complete. There's no telling if async JS will complete before or after any other event. It completes when it's ready, and could be different every time. That would explain why you might sometimes get all, some, or none of that content when you call browser.page_source because it depends how fast async JS happens to be working at that moment.

So, after opening the page, you might try waiting a few seconds before getting the source - giving the JS which loads the content time to complete.

browser.get(googleURL)
time.sleep(3)
content = browser.page_source
like image 150
Dingredient Avatar answered Sep 30 '22 11:09

Dingredient