Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing Google Scholar results with Python and BeautifulSoup

Given a typical keyword search in Google Scholar (see screenshot), I want to get a dictionary containing the title and url of each publication appearing on the page (eg. results = {'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells', 'url': 'https://www.nature.com/articles/338427a0'}.

enter image description here

To retrieve the results page from Google Scholar, I am using the following code:

from urllib import FancyURLopener, quote_plus
from bs4 import BeautifulSoup

class AppURLOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'

openurl = AppURLOpener().open
query = "Vicia faba"
url = 'https://scholar.google.com/scholar?q=' + quote_plus(query) + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
#print url
content = openurl(url).read()
page = BeautifulSoup(content, 'lxml')
print page

This code correctly returns the results page, in (very ugly) HTML format. However, I have not been been able to progress beyond this point, as I could not figure out how to use BeautifulSoup (to which I am not too much familiarized) to parse the results page and retrieve the data.

Notice that the issue is with the parsing of and extracting of data from the results page, not with Google Scholar itself, since the results page is correctly retrieved by the above code.

Could anyone please give a few hints? Thanks in advance!

like image 207
maurobio Avatar asked May 27 '18 19:05

maurobio


People also ask

Does Google Scholar allow web scraping?

Our Google Scholar API allows you to scrape SERP results from a Google Scholar search query.


2 Answers

Inspecting the page content shows that search results are wrapped in an h3 tag, with attribute class="gs_rt". You can use BeautifulSoup to pull out just those tags, then get the title and URL from the <a> tag inside each entry. Write each title/URL to a dict, and store in a list of dicts:

import requests
from bs4 import BeautifulSoup

query = "Vicia%20faba"
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'

content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
    results.append({"title": entry.a.text, "url": entry.a['href']})

Output:

[{'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells',
  'url': 'https://www.nature.com/articles/338427a0'},
 {'title': 'Hydrogen peroxide is involved in abscisic acid-induced stomatal closure in Vicia faba',
  'url': 'http://www.plantphysiol.org/content/126/4/1438.short'},
 ...]

Note: I used requests instead of urllib, as my urllib wouldn't load FancyURLopener. But the BeautifulSoup syntax should be the same, regardless of how you get the page content.

like image 139
andrew_reece Avatar answered Oct 05 '22 08:10

andrew_reece


The answer from andrew_reece at the moment of answering this question isn't working even that the h3 tag with the correct class is located in the source code it will still throw an error e.g. get a CAPTCHA because Google detected your script as an automated script. Print response to see the message.

I got this after sending too many requests:

The block will expire shortly after those requests stop.
Sometimes you may be asked to solve the CAPTCHA
if you are using advanced terms that robots are known to use, 
or sending requests very quickly.

The first thing you can do is to add proxies to your request:

#https://docs.python-requests.org/en/master/user/advanced/#proxies

proxies = {
  'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}

Request code will be like this:

html = requests.get('google scholar link', headers=headers, proxies=proxies).text

Or you can make it work by using requests-HTML or selenium or pyppeteer without proxies, just rendering the page.

Code:

# If you'll get an empty array, this means you get a CAPTCHA. 

from requests_html import HTMLSession
import json

session = HTMLSession()
response = session.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=vicia+faba&btnG=')

# https://requests-html.kennethreitz.org/#javascript-support
response.html.render()

results = []

# Container where data we need is located
for result in response.html.find('.gs_ri'):
    title = result.find('.gs_rt', first = True).text
    # print(title)
    
    # converting dict of URLs to strings (see how it will be without next() iter())
    url = next(iter(result.absolute_links))
    # print(url)

    results.append({
        'title': title,
        'url': url,
    })

print(json.dumps(results, indent = 2, ensure_ascii = False))

Part of the output:

[
  {
    "title": "Faba bean (Vicia faba L.)",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
  },
  {
    "title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
    "url": "https://scholar.google.com/scholar?cluster=956029896799880103&hl=en&as_sdt=0,5"
  }
]

Essentially, you can do the same with Google Scholar API from SerpApi. But you don't have to render the page or use browser automating e.g. selenium to get data from Google Scholar. Get an instant JSON output, that will be faster than selenium or reqests-html, without thinking about how to bypass Google blocking.

It's a paid API with a trial of 5,000 searches. A completely free trial is currently under development.

Code to integrate:

from serpapi import GoogleSearch
import json

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google_scholar",
  "q": "vicia faba",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

results_data = []

for result in results['organic_results']:
    title = result['title']
    url = result['link']

    results_data.append({
        'title': title,
        'url': url,
    })
    
print(json.dumps(results_data, indent = 2, ensure_ascii = False))

Part of the output:

[
  {
    "title": "Faba bean (Vicia faba L.)",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
  },
  {
    "title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429009002512"
  },
]

Disclaimer, I work for SerpApi.

like image 30
Dmitriy Zub Avatar answered Oct 05 '22 09:10

Dmitriy Zub