Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Easy way to scrape Google, download top N hits (entire .html documents) for given search?

Is there an easy way to scrape Google and write the text (just the text) of the top N (say, 1000) .html (or whatever) documents for a given search?

As an example, imagine searching for the phrase "big bad wolf" and downloading just the text from the top 1000 hits -- i.e., actually downloading the text from those 1000 web pages (but just those pages, not the entire site).

I'm assuming this would use the urllib2 library? I use Python 3.1 if that helps.

like image 569
Georgina Avatar asked Mar 16 '11 05:03

Georgina


1 Answers

Check out BeautifulSoup for scraping the content out of web pages. It is supposed to be very tolerant of broken web pages which will help because not all results are well formed. So you should be able to:

  • Request http://www.google.ca/search?q=QUERY_HERE
  • Extract and follow result links using BeautifulSoup (It appears as though class="r" for result links)
  • Extract text from result pages using BeautifulSoup
like image 102
Cody Avatar answered Oct 05 '22 07:10

Cody