Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping and parsing Google search results using Python

Tags:

I asked a question on realizing a general idea to crawl and save webpages. Part of the original question is: how to crawl and save a lot of "About" pages from the Internet.

With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).

Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.

The new question are: in Python, to scrape Google search results for a given keyword, in this case "About", and finally get the links for further parsing. What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).

p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I'd prefer to do it myself if no open-source available and learn more Python in the meanwhile.

Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. :P


Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.

import time, random from xgoogle.search import GoogleSearch, SearchError  f = open('a.txt','wb')  for i in range(0,2):     wt = random.uniform(2, 5)     gs = GoogleSearch("about")     gs.results_per_page = 10     gs.page = i     results = gs.get_results()     #Try not to annnoy Google, with a random short wait     time.sleep(wt)     print 'This is the %dth iteration and waited %f seconds' % (i, wt)     for res in results:         f.write(res.url.encode("utf8"))         f.write("\n")  print "Done" f.close() 

Note on xgoogle (below answered by Mike Pennington): The latest version from it's Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google's change/block.


Resources known so far:

  • For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it's library for offline/local usage. Mechanize was brought up quite several times in different discussions too.

  • For parsing HTML, BeautifulSoup seems to be the one of the most popular choices. Of course. lxml too.

like image 925
Flake Avatar asked Oct 12 '11 21:10

Flake


People also ask

Can I scrape Google search results?

There are two ways to scrape and dissect Google search results: the hard way and the easy way. The hard way involves writing a code to: The hard way involves writing a code to: Use Selenium or a similar framework to initiate a headless browser instance.


2 Answers

You may find xgoogle useful... much of what you seem to be asking for is there...

like image 187
Mike Pennington Avatar answered Oct 20 '22 03:10

Mike Pennington


There is a twill lib for emulating browser. I used it when had a necessity to login with google email account. While it's a great tool with a great idea, it's pretty old and seems to have a lack of support nowadays (the latest version is released in 2007). It might be useful if you want to retrieve results that require cookie-handling or authentication. Likely that twill is one of the best choices for that purposes. BTW, it's based on mechanize.

As for parsing, you are right, BeautifulSoup and Scrapy are great. One of the cool things behind BeautifulSoup is that it can handle invalid HTML (unlike Genshi, for example.)

like image 26
ikostia Avatar answered Oct 20 '22 02:10

ikostia