I am trying to download all PDFs from http://www.fayette-pva.com/. I believe the problem is that when hovering over the link to download the PDF chrome shows the URL in the bottom left hand corner without a <code>.pdf</code> file extension. I saw and used another forum answer similar to this but the <code>.pdf</code> extension was used for the URL when hovering over the PDF link with my cursor. I have tried the same code that is in the link below but it doesn't pick up the PDF files. Here is the code I have been testing with: <pre class="prettyprint"><code>wget --no-directories -e robots=off -A.pdf -r -l1 \ http://www.fayette-pva.com/sales-reports/salesreport03-feb-09feb2015/ </code></pre> I am using this on a single page of which I know that it has a PDF on it. The complete code should be something like <pre class="prettyprint"><code>wget --no-directories -e robots=off -A.pdf -r http://www.fayette-pva.com/ </code></pre> Related answer: WGET problem downloading pdfs from website I am not sure if downloading the entire website would work and if it wouldn't take forever. How do I get around this and download only the PDFs?

Yes, the problem is precisely what you stated: The URLs do not contain regular or absolute filenames, but are calls to a script/servlet/... which hands out the actual files. The solution is to use the <code>--content-disposition</code> option, which tells <code>wget</code> to honor the <code>Content-Disposition</code> field in the HTTP response, which carries the actual filename: <pre class="prettyprint"><code>HTTP/1.1 200 OK (...) Content-Disposition: attachment; filename="SalesIndexThru09Feb2015.pdf" (...) Connection: close </code></pre> This option is supported in <code>wget</code> at least since version <code>1.11.4</code>, which is already 7 years old. So you would do the following: <pre class="prettyprint"><code>wget --no-directories --content-disposition -e robots=off -A.pdf -r \ http://www.fayette-pva.com/ </code></pre>

wget downloading only PDFs from website

Tags:

pdf

wget

I am trying to download all PDFs from http://www.fayette-pva.com/.

I believe the problem is that when hovering over the link to download the PDF chrome shows the URL in the bottom left hand corner without a .pdf file extension. I saw and used another forum answer similar to this but the .pdf extension was used for the URL when hovering over the PDF link with my cursor. I have tried the same code that is in the link below but it doesn't pick up the PDF files.

Here is the code I have been testing with:

wget --no-directories -e robots=off -A.pdf -r -l1 \
    http://www.fayette-pva.com/sales-reports/salesreport03-feb-09feb2015/

I am using this on a single page of which I know that it has a PDF on it.

The complete code should be something like

wget --no-directories -e robots=off -A.pdf -r http://www.fayette-pva.com/

Related answer: WGET problem downloading pdfs from website

I am not sure if downloading the entire website would work and if it wouldn't take forever. How do I get around this and download only the PDFs?

455

asked Feb 18 '15 18:02

user18101

1 Answers

Yes, the problem is precisely what you stated: The URLs do not contain regular or absolute filenames, but are calls to a script/servlet/... which hands out the actual files.

The solution is to use the --content-disposition option, which tells wget to honor the Content-Disposition field in the HTTP response, which carries the actual filename:

HTTP/1.1 200 OK
(...)
Content-Disposition: attachment; filename="SalesIndexThru09Feb2015.pdf"
(...)
Connection: close

This option is supported in wget at least since version 1.11.4, which is already 7 years old.

So you would do the following:

wget --no-directories --content-disposition -e robots=off -A.pdf -r \
    http://www.fayette-pva.com/

answered Nov 11 '22 22:11

zb226

Related questions
                            
                                iText: Reduce image quality (for reducing the resulting PDF size)
                            
                                Batch generating barcodes using ReportLab
                            
                                Replace text inside a PDF file using iText
                            
                                How to add Arabic chars in flutter pdf generating
                            
                                Interactive PDF Creation Alternatives to Acrobat?
                            
                                Extract text from PDF
                            
                                Show PDF in Android
                            
                                jQuery check when PDF download is complete
                            
                                Create outlines/TOC for existing PDF in Python
                            
                                Can one export Special symbols / Cyrillic letters in plot labels when exporting graphics to PDF ?
                            
                                Embed SVG into PDF programmatically in .NET C#
                            
                                Combining PDF with GhostScript: Using Original Bookmarks with corrected page numbers
                            
                                Handling (remapping) missing/problematic (CID/CJK) fonts in PDF with ghostscript?
                            
                                ReportLab: Text with large font size is crammed within paragraph
                            
                                Android version of iOS Quick Look Framework
                            
                                How to use Existing .so file in android application [closed]
                            
                                Enable PDF fast web view using Java?
                            
                                Why does Prawn resize my image automatically?
                            
                                Detect and alter strings in PDFs
                            
                                NReco PDF has black squares on Azure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With