How to download all files (but not HTML) from a website using wget?

People also ask

How do I download all files from wget?

When you use -r or –recursive option with wget, it will download all files & folders and recursively, without any filters. If you don't want to download specific files or folders, you exclude them using -R or –reject option, followed by the file or folder name to be excluded.

How do you wget multiple files?

If you want to download multiple files at once, use the -i option followed by the path to a local or external file containing a list of the URLs to be downloaded. Each URL needs to be on a separate line. If you specify - as a filename, URLs will be read from the standard input.

To filter for specific file extensions:

wget -A pdf,jpg -m -p -E -k -K -np http://site/path/

Or, if you prefer long option names:

wget --accept pdf,jpg --mirror --page-requisites --adjust-extension --convert-links --backup-converted --no-parent http://site/path/

This will mirror the site, but the files without jpg or pdf extension will be automatically removed.

This downloaded the entire website for me:

wget --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla http://site/path/

wget -m -p -E -k -K -np http://site/path/

man page will tell you what those options do.

wget will only follow links, if there is no link to a file from the index page, then wget will not know about its existence, and hence not download it. ie. it helps if all files are linked to in web pages or in directory indexes.

I was trying to download zip files linked from Omeka's themes page - pretty similar task. This worked for me:

wget -A zip -r -l 1 -nd http://omeka.org/add-ons/themes/

-A: only accept zip files
-r: recurse
-l 1: one level deep (ie, only files directly linked from this page)
-nd: don't create a directory structure, just download all the files into this directory.

All the answers with -k, -K, -E etc options probably haven't really understood the question, as those as for rewriting HTML pages to make a local structure, renaming .php files and so on. Not relevant.

To literally get all files except .html etc:

wget -R html,htm,php,asp,jsp,js,py,css -r -l 1 -nd http://yoursite.com

You may try:

wget --user-agent=Mozilla --content-disposition --mirror --convert-links -E -K -p http://example.com/

Also you can add:

-A pdf,ps,djvu,tex,doc,docx,xls,xlsx,gz,ppt,mp4,avi,zip,rar

to accept the specific extensions, or to reject only specific extensions:

-R html,htm,asp,php

or to exclude the specific areas:

-X "search*,forum*"

If the files are ignored for robots (e.g. search engines), you've to add also: -e robots=off

Try this. It always works for me

wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

wget -m -A * -pk -e robots=off www.mysite.com/

this will download all type of files locally and point to them from the html file and it will ignore robots file

Related questions
                            
                                ERROR 1130 (HY000): Host '' is not allowed to connect to this MySQL server [duplicate]
                            
                                How to start nginx via different port(other than 80)
                            
                                Ubuntu running `pip install` gives error 'The following required packages can not be built: * freetype'
                            
                                How to test which port MySQL is running on and whether it can be connected to?
                            
                                The command rbenv install is missing
                            
                                apt-get install tzdata noninteractive
                            
                                How to install the current version of Go in Ubuntu Precise
                            
                                qmake: could not find a Qt installation of ''
                            
                                Colors with unix command "watch"? [closed]
                            
                                Disable password authentication for SSH [closed]
                            
                                What is docker.io in relation to docker-ce and docker-ee?
                            
                                How to write a cron that will run a script every day at midnight?
                            
                                How to upgrade AWS CLI to the latest version?
                            
                                How to set the locale inside a Debian/Ubuntu Docker container?
                            
                                MySQL fails on: mysql "ERROR 1524 (HY000): Plugin 'auth_socket' is not loaded"
                            
                                Unable to install Android Studio in Ubuntu [duplicate]
                            
                                C++ error: undefined reference to 'clock_gettime' and 'clock_settime'
                            
                                How can I setup & run PhantomJS on Ubuntu?
                            
                                How can I completely uninstall nodejs, npm and node in Ubuntu [closed]
                            
                                Renaming a virtualenv folder without breaking it

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to download all files (but not HTML) from a website using wget?

Tags:

download

wget

ubuntu

People also ask

Recent Activity

Donate For Us