Given a typical keyword search in Google Scholar (see screenshot), I want to get a dictionary containing the title and url of each publication appearing on the page (eg. results = {'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells', 'url': 'https://www.nature.com/articles/338427a0'
}.
To retrieve the results page from Google Scholar, I am using the following code:
from urllib import FancyURLopener, quote_plus
from bs4 import BeautifulSoup
class AppURLOpener(FancyURLopener):
version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = AppURLOpener().open
query = "Vicia faba"
url = 'https://scholar.google.com/scholar?q=' + quote_plus(query) + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
#print url
content = openurl(url).read()
page = BeautifulSoup(content, 'lxml')
print page
This code correctly returns the results page, in (very ugly) HTML format. However, I have not been been able to progress beyond this point, as I could not figure out how to use BeautifulSoup (to which I am not too much familiarized) to parse the results page and retrieve the data.
Notice that the issue is with the parsing of and extracting of data from the results page, not with Google Scholar itself, since the results page is correctly retrieved by the above code.
Could anyone please give a few hints? Thanks in advance!
Our Google Scholar API allows you to scrape SERP results from a Google Scholar search query.
Inspecting the page content shows that search results are wrapped in an h3
tag, with attribute class="gs_rt"
. You can use BeautifulSoup to pull out just those tags, then get the title and URL from the <a>
tag inside each entry. Write each title/URL to a dict, and store in a list of dicts:
import requests
from bs4 import BeautifulSoup
query = "Vicia%20faba"
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
results.append({"title": entry.a.text, "url": entry.a['href']})
Output:
[{'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells',
'url': 'https://www.nature.com/articles/338427a0'},
{'title': 'Hydrogen peroxide is involved in abscisic acid-induced stomatal closure in Vicia faba',
'url': 'http://www.plantphysiol.org/content/126/4/1438.short'},
...]
Note: I used requests
instead of urllib
, as my urllib
wouldn't load FancyURLopener
. But the BeautifulSoup syntax should be the same, regardless of how you get the page content.
The answer from andrew_reece at the moment of answering this question isn't working even that the h3
tag with the correct class is located in the source code it will still throw an error e.g. get a CAPTCHA because Google detected your script as an automated script. Print response to see the message.
I got this after sending too many requests:
The block will expire shortly after those requests stop.
Sometimes you may be asked to solve the CAPTCHA
if you are using advanced terms that robots are known to use,
or sending requests very quickly.
The first thing you can do is to add proxies to your request:
#https://docs.python-requests.org/en/master/user/advanced/#proxies
proxies = {
'http': os.getenv('HTTP_PROXY') # Or just type your proxy here without os.getenv()
}
Request code will be like this:
html = requests.get('google scholar link', headers=headers, proxies=proxies).text
Or you can make it work by using requests-HTML
or selenium
or pyppeteer without proxies, just rendering the page.
Code:
# If you'll get an empty array, this means you get a CAPTCHA.
from requests_html import HTMLSession
import json
session = HTMLSession()
response = session.get('https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=vicia+faba&btnG=')
# https://requests-html.kennethreitz.org/#javascript-support
response.html.render()
results = []
# Container where data we need is located
for result in response.html.find('.gs_ri'):
title = result.find('.gs_rt', first = True).text
# print(title)
# converting dict of URLs to strings (see how it will be without next() iter())
url = next(iter(result.absolute_links))
# print(url)
results.append({
'title': title,
'url': url,
})
print(json.dumps(results, indent = 2, ensure_ascii = False))
Part of the output:
[
{
"title": "Faba bean (Vicia faba L.)",
"url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
},
{
"title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
"url": "https://scholar.google.com/scholar?cluster=956029896799880103&hl=en&as_sdt=0,5"
}
]
Essentially, you can do the same with Google Scholar API from SerpApi. But you don't have to render the page or use browser automating e.g. selenium
to get data from Google Scholar. Get an instant JSON output, that will be faster than selenium
or reqests-html
, without thinking about how to bypass Google blocking.
It's a paid API with a trial of 5,000 searches. A completely free trial is currently under development.
Code to integrate:
from serpapi import GoogleSearch
import json
params = {
"api_key": "YOUR_API_KEY",
"engine": "google_scholar",
"q": "vicia faba",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
results_data = []
for result in results['organic_results']:
title = result['title']
url = result['link']
results_data.append({
'title': title,
'url': url,
})
print(json.dumps(results_data, indent = 2, ensure_ascii = False))
Part of the output:
[
{
"title": "Faba bean (Vicia faba L.)",
"url": "https://www.sciencedirect.com/science/article/pii/S0378429097000257"
},
{
"title": "Nutritional value of faba bean (Vicia faba L.) seeds for feed and food",
"url": "https://www.sciencedirect.com/science/article/pii/S0378429009002512"
},
]
Disclaimer, I work for SerpApi.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With