I'd like to write a simple web spider or just use wget
to download pdf results from google scholar. That would actually be quite a spiffy way to get papers for research.
I have read the following pages on stackoverflow:
Crawl website using wget and limit total number of crawled links
How do web spiders differ from Wget's spider?
Downloading all PDF files from a website
How to download all files (but not HTML) from a website using wget?
The last page is probably the most inspirational of all. I did try using wget
as suggested on this.
My google scholar search result page is thus but nothing was downloaded.
Given that my level of understanding of webspiders is minimal, what should I do to make this possible? I do realize that writing a spider is perhaps very involved and is a project I may not want to undertake. If it is possible using wget
, that would be absolutely awesome.
In order to download multiples files using Wget, you need to create a . txt file and insert the URLs of the files you wish to download. After inserting the URLs inside the file, use the wget command with the -i option followed by the name of the . txt file containing the URLs.
wget -e robots=off -H --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3" -r -l 1 -nd -A pdf http://scholar.google.com/scholar?q=filetype%3Apdf+liquid+films&btnG=&hl=en&as_sdt=0%2C23
A few things to note:
The limitation of course is that this will only hit the first page of results. You could expand the depth of recursion, but this will run wild and take forever. I would recommend using a combination of something like Beautiful Soup and wget subprocesses, so that you can parse and traverse the search results strategically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With