I'd like to use python to scrape google scholar search results. I found two different script to do that, one is gscholar.py and the other is scholar.py
(can that one be used as a python library?).
Now, I should maybe say that I'm totally new to python, so sorry if I miss the obvious!
The problem is when I use gscholar.py
as explained in the README file, I get as a result
query() takes at least 2 arguments (1 given)
.
Even when I specify another argument (e.g. gscholar.query("my query", allresults=True)
, I get
query() takes at least 2 arguments (2 given)
.
This puzzles me. I also tried to specify the third possible argument (outformat=4
; which is the BibTex format) but this gives me a list of function errors. A colleague advised me to import BeautifulSoup and this before running the query, but also that doesn't change the problem. Any suggestions how to solve the problem?
I found code for R (see link) as a solution but got quickly blocked by google. Maybe someone could suggest how improve that code to avoid being blocked? Any help would be appreciated! Thanks!
Step 1: Firstly, prepare virtual environment and install libraries for CSS selectors to extract data from relevant tags and attributes. Step 2: Add the SelectorGadget Extensionsto grab data from CSS selectors. Then use the specific Python codes to scrape Google Scholar organic search results.
Our Google Scholar API allows you to scrape SERP results from a Google Scholar search query. The API is accessed through the following endpoint: /search?
You can save articles to your Google Scholar Library using the "Save" link (star icon) below the article record. If you are not logged into Google Scholar (or another Google service) you will be prompted to log in. After logging in you will be returned to you search results and you can now save articles.
It looks like scraping with Python and R runs into the problem where Google Scholar sees your request as a robot query due to a lack of a user-agent in the request. There is a similar question in StackExchange about downloading all pdfs linked from a web page and the answer leads the user to wget in Unix and the BeautifulSoup package in Python.
Curl also seems to be a more promising direction.
I suggest you not to use specific libraries for crawling specific websites, but to use general purpose HTML libraries that are well tested and has well formed documentation such as BeautifulSoup.
For accessing websites with a browser information, you could use an url opener class with a custom user agent:
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = MyOpener().open
And then download the required url as follows:
openurl(url).read()
For retrieving scholar results just use http://scholar.google.se/scholar?hl=en&q=${query}
url.
To extract pieces of information from a retrieved HTML file, you could use this piece of code:
from bs4 import SoupStrainer, BeautifulSoup
page = BeautifulSoup(openurl(url).read(), parse_only=SoupStrainer('div', id='gs_ab_md'))
This piece of code extracts a concrete div
element that contains number of results shown in a Google Scholar search results page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With