I'd like to write a simple web spider or just use <code>wget</code> to download pdf results from google scholar. That would actually be quite a spiffy way to get papers for research. I have read the following pages on stackoverflow: Crawl website using wget and limit total number of crawled links How do web spiders differ from Wget's spider? Downloading all PDF files from a website How to download all files (but not HTML) from a website using wget? The last page is probably the most inspirational of all. I did try using <code>wget</code> as suggested on this. My google scholar search result page is thus but nothing was downloaded. Given that my level of understanding of webspiders is minimal, what should I do to make this possible? I do realize that writing a spider is perhaps very involved and is a project I may not want to undertake. If it is possible using <code>wget</code>, that would be absolutely awesome.

<pre class="prettyprint"><code>wget -e robots=off -H --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3" -r -l 1 -nd -A pdf http://scholar.google.com/scholar?q=filetype%3Apdf+liquid+films&btnG=&hl=en&as_sdt=0%2C23 </code></pre> A few things to note: <ol> <li>Use of filetyle:pdf in the search query</li> <li>One level of recursion</li> <li>-A pdf for only accepting pdfs</li> <li>-H to span hosts</li> <li>-e robots=off and use of --user-agent will ensure best results. Google Scholar rejects a blank user agent, and pdf repositories are likely to disallow robots.</li> </ol> The limitation of course is that this will only hit the first page of results. You could expand the depth of recursion, but this will run wild and take forever. I would recommend using a combination of something like Beautiful Soup and wget subprocesses, so that you can parse and traverse the search results strategically.

Downloading all pdf files from google scholar search results using wget

Tags:

unix

wget

web-crawler

I'd like to write a simple web spider or just use wget to download pdf results from google scholar. That would actually be quite a spiffy way to get papers for research.

I have read the following pages on stackoverflow:

Crawl website using wget and limit total number of crawled links

How do web spiders differ from Wget's spider?

Downloading all PDF files from a website

How to download all files (but not HTML) from a website using wget?

The last page is probably the most inspirational of all. I did try using wget as suggested on this.

My google scholar search result page is thus but nothing was downloaded.

Given that my level of understanding of webspiders is minimal, what should I do to make this possible? I do realize that writing a spider is perhaps very involved and is a project I may not want to undertake. If it is possible using wget, that would be absolutely awesome.

711

asked Sep 04 '12 23:09

dearN

1 Answers

wget -e robots=off -H --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3" -r -l 1 -nd -A pdf http://scholar.google.com/scholar?q=filetype%3Apdf+liquid+films&btnG=&hl=en&as_sdt=0%2C23

A few things to note:

Use of filetyle:pdf in the search query
One level of recursion
-A pdf for only accepting pdfs
-H to span hosts
-e robots=off and use of --user-agent will ensure best results. Google Scholar rejects a blank user agent, and pdf repositories are likely to disallow robots.

The limitation of course is that this will only hit the first page of results. You could expand the depth of recursion, but this will run wild and take forever. I would recommend using a combination of something like Beautiful Soup and wget subprocesses, so that you can parse and traverse the search results strategically.

answered Nov 12 '22 22:11

dongle

Related questions
                            
                                Behavior of a pipe after a fork()
                            
                                Condition Variable in Shared Memory - is this code POSIX-conformant?
                            
                                Getting fault address that generated a UNIX signal
                            
                                How do i get less or more to recognize keystrokes when piping from a cli php script?
                            
                                Why does taking stdin from a file differ from receiving it over a pipe?
                            
                                Delete n1 previous lines and n2 lines following with respect to a line containing a pattern
                            
                                Two file descriptor from different process point to the same entry in open file table
                            
                                What does an asterisk at the end of a mv command do
                            
                                Connecting to mongoDB from bash shell script
                            
                                Why does Linux use getdents() on directories instead of read()?
                            
                                How to create a map of key:array in shell?
                            
                                "printf -v" inside function not working with redirected output
                            
                                autoconf using sh, I need SHELL=BASH, how do I force autoconf to use bash?
                            
                                Catching / blocking SIGINT during system call
                            
                                Nagios timeout configuration
                            
                                Join all files in a directory, with a separator
                            
                                Install libmad on Mac OS X Lion: "error: CPU you selected does not support x86-64 instruction set"
                            
                                sed insert after ONLY the first line match?
                            
                                What is the difference between Linux and Unix? [closed]
                            
                                converting the hash tag timestamps in history file to desired string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With