Have a wget download I'm trying to perform.
It downloads several thousand files, unless I start to restrict the file type (junk files etc). In theory restricting the file type is fine.
However there are lots of files that wget downloads without a file extension, that when manually opened with Adobe for example, are actually PDF's. These are actually the files I want.
Restricting the wget to filetype PDF does not download these files.
So far my syntax is wget -r --no-parent A.pdf www.websitehere.com
Using wget -r --no-parent www.websitehere.com brings me every file type, so in theory I have everything. But this means I have 1000's of junk files to remove, and then several hundred of the useful files of unknown file type to rename.
Any ideas on how to wget and save the files with the appropriate file extension?
Alternatively, a way restrict the wget to only files without a file extension, and then a separate batch method to determine the file type and rename appropriately?
Manually testing every file to determine the appropriate application will take a lot of time.
Appreciate any help!
wget
has an --adjust-extension
option, which will add the correct extensions to HTML and CSS files. Other files (like PDFs) may not work, though. See the complete documentation here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With