Parsing Google Scholar results with Python and BeautifulSoup

Tags:

Given a typical keyword search in Google Scholar (see screenshot), I want to get a dictionary containing the title and url of each publication appearing on the page (eg. results = {'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells', 'url': 'https://www.nature.com/articles/338427a0'}.

enter image description here

To retrieve the results page from Google Scholar, I am using the following code:

from urllib import FancyURLopener, quote_plus
from bs4 import BeautifulSoup

class AppURLOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'

openurl = AppURLOpener().open
query = "Vicia faba"
url = 'https://scholar.google.com/scholar?q=' + quote_plus(query) + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
#print url
content = openurl(url).read()
page = BeautifulSoup(content, 'lxml')
print page

This code correctly returns the results page, in (very ugly) HTML format. However, I have not been been able to progress beyond this point, as I could not figure out how to use BeautifulSoup (to which I am not too much familiarized) to parse the results page and retrieve the data.

Notice that the issue is with the parsing of and extracting of data from the results page, not with Google Scholar itself, since the results page is correctly retrieved by the above code.

Could anyone please give a few hints? Thanks in advance!

207

asked May 27 '18 19:05

maurobio

2 Answers

Inspecting the page content shows that search results are wrapped in an h3 tag, with attribute class="gs_rt". You can use BeautifulSoup to pull out just those tags, then get the title and URL from the <a> tag inside each entry. Write each title/URL to a dict, and store in a list of dicts:

import requests
from bs4 import BeautifulSoup

query = "Vicia%20faba"
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'

content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
    results.append({"title": entry.a.text, "url": entry.a['href']})

Output:

[{'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells',
  'url': 'https://www.nature.com/articles/338427a0'},
 {'title': 'Hydrogen peroxide is involved in abscisic acid-induced stomatal closure in Vicia faba',
  'url': 'http://www.plantphysiol.org/content/126/4/1438.short'},
 ...]

Note: I used requests instead of urllib, as my urllib wouldn't load FancyURLopener. But the BeautifulSoup syntax should be the same, regardless of how you get the page content.

139

answered Oct 05 '22 08:10

andrew_reece

The answer from andrew_reece at the moment of answering this question isn't working even that the h3 tag with the correct class is located in the source code it will still throw an error e.g. get a CAPTCHA because Google detected your script as an automated script. Print response to see the message.

I got this after sending too many requests:

The block will expire shortly after those requests stop.
Sometimes you may be asked to solve the CAPTCHA
if you are using advanced terms that robots are known to use, 
or sending requests very quickly.

The first thing you can do is to add proxies to your request:

#https://docs.python-requests.org/en/master/user/advanced/#proxies

proxies = {
  'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}

Request code will be like this:

html = requests.get('google scholar link', headers=headers, proxies=proxies).text

Or you can make it work by using requests-HTML or selenium or pyppeteer without proxies, just rendering the page.

Code:

# If you'll get an empty array, this means you get a CAPTCHA. 

from requests_html import HTMLSession
import json

session = HTMLSession()
response = session.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=vicia+faba&btnG=')

# https://requests-html.kennethreitz.org/#javascript-support
response.html.render()

results = []

# Container where data we need is located
for result in response.html.find('.gs_ri'):
    title = result.find('.gs_rt', first = True).text
    # print(title)
    
    # converting dict of URLs to strings (see how it will be without next() iter())
    url = next(iter(result.absolute_links))
    # print(url)

    results.append({
        'title': title,
        'url': url,
    })

print(json.dumps(results, indent = 2, ensure_ascii = False))

Part of the output:

[
  {
    "title": "Faba bean (Vicia faba L.)",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
  },
  {
    "title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
    "url": "https://scholar.google.com/scholar?cluster=956029896799880103&hl=en&as_sdt=0,5"
  }
]

Essentially, you can do the same with Google Scholar API from SerpApi. But you don't have to render the page or use browser automating e.g. selenium to get data from Google Scholar. Get an instant JSON output, that will be faster than selenium or reqests-html, without thinking about how to bypass Google blocking.

It's a paid API with a trial of 5,000 searches. A completely free trial is currently under development.

Code to integrate:

from serpapi import GoogleSearch
import json

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google_scholar",
  "q": "vicia faba",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

results_data = []

for result in results['organic_results']:
    title = result['title']
    url = result['link']

    results_data.append({
        'title': title,
        'url': url,
    })
    
print(json.dumps(results_data, indent = 2, ensure_ascii = False))

Part of the output:

[
  {
    "title": "Faba bean (Vicia faba L.)",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
  },
  {
    "title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
    "url": "https://www.sciencedirect.com/science/article/pii/S0378429009002512"
  },
]

Disclaimer, I work for SerpApi.

answered Oct 05 '22 09:10

Dmitriy Zub

Related questions
                            
                                How to prevent brute force attack in Django Rest + Using Django Rest Throttling
                            
                                Transform rows to columns by the values of two rows in pandas
                            
                                Pandas: how to sum by groupby value
                            
                                Check if 2d array exists in 3d array in Python?
                            
                                Replacing all negative values in certain columns by another value in Pandas
                            
                                Delete pandas column if column name begins with a number
                            
                                Dart - Base64 string is not equal to python
                            
                                Check if string column last characters are numbers in Pandas
                            
                                spyder IDE - make variable explorer to follow the color scheme of the Editor
                            
                                Computing jacobian matrix in Tensorflow
                            
                                Python: The implementation of im2col which takes the advantages of 6 dimensional array?
                            
                                Prevent Kivy leaving debug messages
                            
                                Numpy division by 0 workaround
                            
                                pandas to_excel() ignore/allow duplicate column names
                            
                                Django REST how to set throttle period to allow one request in 10 minutes?
                            
                                Convert a dictionary of nested lists to a pandas DataFrame
                            
                                Extracting the file extensions from file names in pandas
                            
                                Returning the actual index value of max & min values from a Pandas Dataframe column
                            
                                Parsing yaml file and getting a dictionary
                            
                                imgaug: load and save images

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing Google Scholar results with Python and BeautifulSoup

Tags:

python

beautifulsoup

google-scholar

maurobio

People also ask

2 Answers

andrew_reece

Dmitriy Zub

Recent Activity

Donate For Us