I'd like to use python to scrape google scholar search results. I found two different script to do that, one is gscholar.py and the other is <code>scholar.py</code> (can that one be used as a python library?). Now, I should maybe say that I'm totally new to python, so sorry if I miss the obvious! The problem is when I use <code>gscholar.py</code> as explained in the README file, I get as a result <code>query() takes at least 2 arguments (1 given)</code>. Even when I specify another argument (e.g. <code>gscholar.query("my query", allresults=True)</code>, I get <code>query() takes at least 2 arguments (2 given)</code>. This puzzles me. I also tried to specify the third possible argument (<code>outformat=4</code>; which is the BibTex format) but this gives me a list of function errors. A colleague advised me to import BeautifulSoup and this before running the query, but also that doesn't change the problem. Any suggestions how to solve the problem? I found code for R (see link) as a solution but got quickly blocked by google. Maybe someone could suggest how improve that code to avoid being blocked? Any help would be appreciated! Thanks!

I suggest you not to use specific libraries for crawling specific websites, but to use general purpose HTML libraries that are well tested and has well formed documentation such as BeautifulSoup. For accessing websites with a browser information, you could use an url opener class with a custom user agent: <pre class="prettyprint"><code>from urllib import FancyURLopener class MyOpener(FancyURLopener): version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36' openurl = MyOpener().open </code></pre> And then download the required url as follows: <pre class="prettyprint"><code>openurl(url).read() </code></pre> For retrieving scholar results just use <code>http://scholar.google.se/scholar?hl=en&q=${query}</code> url. To extract pieces of information from a retrieved HTML file, you could use this piece of code: <pre class="prettyprint"><code>from bs4 import SoupStrainer, BeautifulSoup page = BeautifulSoup(openurl(url).read(), parse_only=SoupStrainer('div', id='gs_ab_md')) </code></pre> This piece of code extracts a concrete <code>div</code> element that contains number of results shown in a Google Scholar search results page.

Extract Google Scholar results using Python (or R)

Tags:

python

r

google-scholar

I'd like to use python to scrape google scholar search results. I found two different script to do that, one is gscholar.py and the other is scholar.py (can that one be used as a python library?).

Now, I should maybe say that I'm totally new to python, so sorry if I miss the obvious!

The problem is when I use gscholar.py as explained in the README file, I get as a result

query() takes at least 2 arguments (1 given).

Even when I specify another argument (e.g. gscholar.query("my query", allresults=True), I get

query() takes at least 2 arguments (2 given).

This puzzles me. I also tried to specify the third possible argument (outformat=4; which is the BibTex format) but this gives me a list of function errors. A colleague advised me to import BeautifulSoup and this before running the query, but also that doesn't change the problem. Any suggestions how to solve the problem?

I found code for R (see link) as a solution but got quickly blocked by google. Maybe someone could suggest how improve that code to avoid being blocked? Any help would be appreciated! Thanks!

853

asked Nov 02 '12 18:11

Flow

2 Answers

It looks like scraping with Python and R runs into the problem where Google Scholar sees your request as a robot query due to a lack of a user-agent in the request. There is a similar question in StackExchange about downloading all pdfs linked from a web page and the answer leads the user to wget in Unix and the BeautifulSoup package in Python.

Curl also seems to be a more promising direction.

160

answered Oct 13 '22 01:10

y-i_guy

I suggest you not to use specific libraries for crawling specific websites, but to use general purpose HTML libraries that are well tested and has well formed documentation such as BeautifulSoup.

For accessing websites with a browser information, you could use an url opener class with a custom user agent:

from urllib import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = MyOpener().open

And then download the required url as follows:

openurl(url).read()

For retrieving scholar results just use http://scholar.google.se/scholar?hl=en&q=${query} url.

To extract pieces of information from a retrieved HTML file, you could use this piece of code:

from bs4 import SoupStrainer, BeautifulSoup
page = BeautifulSoup(openurl(url).read(), parse_only=SoupStrainer('div', id='gs_ab_md'))

This piece of code extracts a concrete div element that contains number of results shown in a Google Scholar search results page.

answered Oct 13 '22 01:10

Julia

Related questions
                            
                                How to install OpenCV for python
                            
                                Can I prevent modifying an object in Python?
                            
                                Using reStructuredText to add some HTML with custom "id" and "class" attributes
                            
                                Django: chaining 'startswith' and 'iexact' query filters?
                            
                                how to group objects in reportlab, so that they stay together across new pages
                            
                                Reading non-uniform data from file into array with NumPy
                            
                                Adding Zooming in and out with a Tkinter Canvas Widget?
                            
                                What does [...] (an ellipsis) in a list mean in Python? [duplicate]
                            
                                Checking for nan in Cython
                            
                                Why does importing a python module not import nested modules?
                            
                                Retrieve browser headers in Python
                            
                                Flask-Babel how to use translation in Jinja template file
                            
                                Accurate binary image classification
                            
                                Django - Is storing objects in session a good practice?
                            
                                Matplotlib suptitle prints over old title
                            
                                Weird closure behavior in python
                            
                                Parallel construction of a distance matrix
                            
                                Append to a dict of lists with a dict comprehension
                            
                                Change &#39 into normal character
                            
                                python tkinter return value from function used in command

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With