Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google Search Web Scraping with Python

I've been learning a lot of python lately to work on some projects at work.

Currently I need to do some web scraping with google search results. I found several sites that demonstrated how to use ajax google api to search, however after attempting to use it, it appears to no longer be supported. Any suggestions?

I've been searching for quite a while to find a way but can't seem to find any solutions that currently work.

like image 872
pbell Avatar asked Jul 27 '16 17:07

pbell


People also ask

Can I scrape Google search results?

Can you scrape Google search results? Yes. You can scrape Google SERP by using Google Search Scraper tool.

Is web scraping with Python legal?

Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don't hold the right to share is illegal.


1 Answers

You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query> this will return the top 10 search results.

Then you can use lxml for example to parse the page. Depending on what you use, you can either query the resulting node tree via a CSS-Selector (.r a) or using a XPath-Selector (//h3[@class="r"]/a)

In some cases the resulting URL will redirect to Google. Usually it contains a query-parameter qwhich will contain the actual request URL.

Example code using lxml and requests:

from urllib.parse import urlencode, urlparse, parse_qs

from lxml.html import fromstring
from requests import get

raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)

for result in page.cssselect(".r a"):
    url = result.get("href")
    if url.startswith("/url?"):
        url = parse_qs(urlparse(url).query)['q']
    print(url[0])

A note on google banning your IP: In my experience, google only bans if you start spamming google with search requests. It will respond with a 503 if Google thinks you are bot.

like image 65
StuxCrystal Avatar answered Oct 04 '22 03:10

StuxCrystal