Using tor and python to scrape Google Scholar

Tags:

I'm working on a project to analyse how journal articles are cited. I have a large file of journal article names. I intend to pass them to Google Scholar and see how many citations each has.

Here is the strategy I am following:

Use "scholar.py" from http://www.icir.org/christian/scholar.html. This is a pre written python script that searches google scholar and returns information on the first hit in CSV format (including number of citations)
Google scholar blocks you after a certain number of searches (I have roughly 3000 article titles to query). I have found that most people use Tor ( How to make urllib2 requests through Tor in Python? and Prevent Custom Web Crawler from being blocked) to solve this problem. Tor is a service that gives you a random IP address every few minutes.

I have scholar.py and tor both successfully set up and working. I'm not very familiar with python or the library urllib2 and wonder what modifications are needed to scholar.py so that queries are routed through Tor.

I am also amenable to suggestions for an easier (and potentially considerably different) approach for mass google scholar queries if one exists.

Thanks in advance

549

asked Jul 12 '12 00:07

krishnan

1 Answers

For me the best way to use TOR is setting up a local proxy like polipo. I like to clone the repo and compile locally:

git clone https://github.com/jech/polipo.git
cd polipo
make all
make install

But you can use your package manager (brew install polipo in mac, apt install polipo on Ubuntu). Then write a simple config file:

echo socksParentProxy=localhost:9050 > ~/.polipo
echo diskCacheRoot='""' >> ~/.polipo
echo disableLocalInterface=true >> ~/.polipo

Then run it:

polipo

See urllib docs on how to use a proxy. Like many unix applications, urllib will honor the environment variable http_proxy:

export http_proxy="http://localhost:8123"
export https_proxy="http://localhost:8123"

I like to use the requests library, a nicer wrapper for urllib. If you don't have it already:

pip install requests

If urllib is using Tor the following one-liner should print True:

python -c "import requests; print('Congratulations' in requests.get('http://check.torproject.org/').text)"

Last thing, beware: the Tor network is not a free pass for doing silly things on the Internet because even using it you should not assume you are totally anonymous.

answered Oct 27 '22 04:10

Paulo Scardine

Related questions
                            
                                Errors packaging app for android using ubuntu and buildozer
                            
                                How can I construct a Pandas DataFrame from individual 1D Numpy arrays without copying
                            
                                Change code while debugging python program in Visual Studio Code
                            
                                Is there an equivalent of kable (R) on python?
                            
                                How to connect a Jupyter Notebook to a Spyder kernel?
                            
                                Extracting the license plate parallelogram from the surrounding bounding box?
                            
                                Most scalable way for using generators with tf.data ? tf.data guide says `from_generator` has limited scalability
                            
                                How to properly handle multiple binary files in python?
                            
                                How to find the minimum number of moves to move an item into a position in a stack?
                            
                                How to find which DLL failed in "ImportError: DLL load failed while importing" in python?
                            
                                VSCode integrated source control and pre-commit
                            
                                Saving and reload huggingface fine-tuned transformer
                            
                                How to intercept the first value of a generator and transparently yield from the rest
                            
                                Beautiful Soup Unicode encode error
                            
                                how to pass an xml file to lxml to parse?
                            
                                What are the pitfalls and workarounds when using Python virtual environments on Windows?
                            
                                Software metric tool for Python
                            
                                Circle-Polygon intersections
                            
                                Using resource in windows
                            
                                How can I send line number from Python traceback into vim?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using tor and python to scrape Google Scholar

Tags:

python

web-scraping

tor

google-scholar

krishnan

People also ask

1 Answers

Paulo Scardine

Recent Activity

Donate For Us