I'm working on a project to analyse how journal articles are cited. I have a large file of journal article names. I intend to pass them to Google Scholar and see how many citations each has.
Here is the strategy I am following:
Use "scholar.py" from http://www.icir.org/christian/scholar.html. This is a pre written python script that searches google scholar and returns information on the first hit in CSV format (including number of citations)
Google scholar blocks you after a certain number of searches (I have roughly 3000 article titles to query). I have found that most people use Tor ( How to make urllib2 requests through Tor in Python? and Prevent Custom Web Crawler from being blocked) to solve this problem. Tor is a service that gives you a random IP address every few minutes.
I have scholar.py and tor both successfully set up and working. I'm not very familiar with python or the library urllib2 and wonder what modifications are needed to scholar.py so that queries are routed through Tor.
I am also amenable to suggestions for an easier (and potentially considerably different) approach for mass google scholar queries if one exists.
Thanks in advance
Our Google Scholar API allows you to scrape SERP results from a Google Scholar search query.
Full Extraction Code. import os from serpapi import GoogleSearch from urllib. parse import urlsplit, parse_qsl def organic_results(): print("extracting organic results..") params = { "api_key": os.
For me the best way to use TOR is setting up a local proxy like polipo. I like to clone the repo and compile locally:
git clone https://github.com/jech/polipo.git
cd polipo
make all
make install
But you can use your package manager (brew install polipo
in mac, apt install polipo
on Ubuntu). Then write a simple config file:
echo socksParentProxy=localhost:9050 > ~/.polipo
echo diskCacheRoot='""' >> ~/.polipo
echo disableLocalInterface=true >> ~/.polipo
Then run it:
polipo
See urllib docs on how to use a proxy. Like many unix applications, urllib will honor the environment variable http_proxy
:
export http_proxy="http://localhost:8123"
export https_proxy="http://localhost:8123"
I like to use the requests library, a nicer wrapper for urllib. If you don't have it already:
pip install requests
If urllib is using Tor the following one-liner should print True:
python -c "import requests; print('Congratulations' in requests.get('http://check.torproject.org/').text)"
Last thing, beware: the Tor network is not a free pass for doing silly things on the Internet because even using it you should not assume you are totally anonymous.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With