Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

wget downloading only PDFs from website

Tags:

pdf

wget

I am trying to download all PDFs from http://www.fayette-pva.com/.

I believe the problem is that when hovering over the link to download the PDF chrome shows the URL in the bottom left hand corner without a .pdf file extension. I saw and used another forum answer similar to this but the .pdf extension was used for the URL when hovering over the PDF link with my cursor. I have tried the same code that is in the link below but it doesn't pick up the PDF files.

Here is the code I have been testing with:

wget --no-directories -e robots=off -A.pdf -r -l1 \
    http://www.fayette-pva.com/sales-reports/salesreport03-feb-09feb2015/

I am using this on a single page of which I know that it has a PDF on it.

The complete code should be something like

wget --no-directories -e robots=off -A.pdf -r http://www.fayette-pva.com/

Related answer: WGET problem downloading pdfs from website

I am not sure if downloading the entire website would work and if it wouldn't take forever. How do I get around this and download only the PDFs?

like image 455
user18101 Avatar asked Feb 18 '15 18:02

user18101


People also ask

How do I download all files using wget?

In order to download multiples files using Wget, you need to create a . txt file and insert the URLs of the files you wish to download. After inserting the URLs inside the file, use the wget command with the -i option followed by the name of the . txt file containing the URLs.

How do I download a PDF in Linux?

Download files from Linux terminal using wget command. wget is perhaps the most used command line download manager for Linux and UNIX-like systems. You can download a single file, multiple files, an entire directory, or even an entire website using wget. wget is non-interactive and can easily work in the background.

How do I download an entire website for offline wget?

To Make an Offline Copy of a Site with Wget, Now, type the following arguments to get the following command: wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://site-to-download.com. Replace the https://site-to-download.com portion with the actual site URL you want to make a mirror of ...


1 Answers

Yes, the problem is precisely what you stated: The URLs do not contain regular or absolute filenames, but are calls to a script/servlet/... which hands out the actual files.

The solution is to use the --content-disposition option, which tells wget to honor the Content-Disposition field in the HTTP response, which carries the actual filename:

HTTP/1.1 200 OK
(...)
Content-Disposition: attachment; filename="SalesIndexThru09Feb2015.pdf"
(...)
Connection: close

This option is supported in wget at least since version 1.11.4, which is already 7 years old.

So you would do the following:

wget --no-directories --content-disposition -e robots=off -A.pdf -r \
    http://www.fayette-pva.com/
like image 85
zb226 Avatar answered Nov 11 '22 22:11

zb226