Scraping and parsing Google search results using Python

Tags:

I asked a question on realizing a general idea to crawl and save webpages. Part of the original question is: how to crawl and save a lot of "About" pages from the Internet.

With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).

Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.

The new question are: in Python, to scrape Google search results for a given keyword, in this case "About", and finally get the links for further parsing. What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).

p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I'd prefer to do it myself if no open-source available and learn more Python in the meanwhile.

Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. :P

Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.

import time, random from xgoogle.search import GoogleSearch, SearchError  f = open('a.txt','wb')  for i in range(0,2):     wt = random.uniform(2, 5)     gs = GoogleSearch("about")     gs.results_per_page = 10     gs.page = i     results = gs.get_results()     #Try not to annnoy Google, with a random short wait     time.sleep(wt)     print 'This is the %dth iteration and waited %f seconds' % (i, wt)     for res in results:         f.write(res.url.encode("utf8"))         f.write("\n")  print "Done" f.close()

Note on xgoogle (below answered by Mike Pennington): The latest version from it's Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google's change/block.

Resources known so far:

For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it's library for offline/local usage. Mechanize was brought up quite several times in different discussions too.
For parsing HTML, BeautifulSoup seems to be the one of the most popular choices. Of course. lxml too.

925

asked Oct 12 '11 21:10

Flake

2 Answers

You may find xgoogle useful... much of what you seem to be asking for is there...

187

answered Oct 20 '22 03:10

Mike Pennington

There is a twill lib for emulating browser. I used it when had a necessity to login with google email account. While it's a great tool with a great idea, it's pretty old and seems to have a lack of support nowadays (the latest version is released in 2007). It might be useful if you want to retrieve results that require cookie-handling or authentication. Likely that twill is one of the best choices for that purposes. BTW, it's based on mechanize.

As for parsing, you are right, BeautifulSoup and Scrapy are great. One of the cool things behind BeautifulSoup is that it can handle invalid HTML (unlike Genshi, for example.)

answered Oct 20 '22 02:10

ikostia

Related questions
                            
                                Eclipse: Instant autocompletion
                            
                                Converting to upper case. Which way is more pythonic? [closed]
                            
                                Sudoku solver in Java, using backtracking and recursion
                            
                                Why does dynamic binding fail when using interface inheritance?
                            
                                How to transfer paid android apps from one google account to another google account
                            
                                KnockoutJS subscribe to property changes with Mapping Plugin
                            
                                How does one handle third party libraries with completely different build systems?
                            
                                So many ways to define a byte
                            
                                When scrolling custom ListView, the checkbox value changes
                            
                                Using built-in spreadsheet functions in a script
                            
                                onSearchRequested() no longer works in Android 4.x+?
                            
                                How to integrate R shiny into current application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With