I asked a question on realizing a general idea to crawl and save webpages. Part of the original question is: how to crawl and save a lot of "About" pages from the Internet.
With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).
Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.
The new question are: in Python, to scrape Google search results for a given keyword, in this case "About", and finally get the links for further parsing. What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).
p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I'd prefer to do it myself if no open-source available and learn more Python in the meanwhile.
Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. :P
Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.
import time, random from xgoogle.search import GoogleSearch, SearchError f = open('a.txt','wb') for i in range(0,2): wt = random.uniform(2, 5) gs = GoogleSearch("about") gs.results_per_page = 10 gs.page = i results = gs.get_results() #Try not to annnoy Google, with a random short wait time.sleep(wt) print 'This is the %dth iteration and waited %f seconds' % (i, wt) for res in results: f.write(res.url.encode("utf8")) f.write("\n") print "Done" f.close()
Note on xgoogle (below answered by Mike Pennington): The latest version from it's Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google's change/block.
Resources known so far:
For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it's library for offline/local usage. Mechanize was brought up quite several times in different discussions too.
For parsing HTML, BeautifulSoup seems to be the one of the most popular choices. Of course. lxml too.
There are two ways to scrape and dissect Google search results: the hard way and the easy way. The hard way involves writing a code to: The hard way involves writing a code to: Use Selenium or a similar framework to initiate a headless browser instance.
You may find xgoogle useful... much of what you seem to be asking for is there...
There is a twill lib for emulating browser. I used it when had a necessity to login with google email account. While it's a great tool with a great idea, it's pretty old and seems to have a lack of support nowadays (the latest version is released in 2007). It might be useful if you want to retrieve results that require cookie-handling or authentication. Likely that twill
is one of the best choices for that purposes. BTW, it's based on mechanize
.
As for parsing, you are right, BeautifulSoup
and Scrapy
are great. One of the cool things behind BeautifulSoup
is that it can handle invalid HTML (unlike Genshi, for example.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With