Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to download all files (but not HTML) from a website using wget?

People also ask

How do I download all files from wget?

When you use -r or –recursive option with wget, it will download all files & folders and recursively, without any filters. If you don't want to download specific files or folders, you exclude them using -R or –reject option, followed by the file or folder name to be excluded.

How do you wget multiple files?

If you want to download multiple files at once, use the -i option followed by the path to a local or external file containing a list of the URLs to be downloaded. Each URL needs to be on a separate line. If you specify - as a filename, URLs will be read from the standard input.


To filter for specific file extensions:

wget -A pdf,jpg -m -p -E -k -K -np http://site/path/

Or, if you prefer long option names:

wget --accept pdf,jpg --mirror --page-requisites --adjust-extension --convert-links --backup-converted --no-parent http://site/path/

This will mirror the site, but the files without jpg or pdf extension will be automatically removed.


This downloaded the entire website for me:

wget --no-clobber --convert-links --random-wait -r -p -E -e robots=off -U mozilla http://site/path/

wget -m -p -E -k -K -np http://site/path/

man page will tell you what those options do.

wget will only follow links, if there is no link to a file from the index page, then wget will not know about its existence, and hence not download it. ie. it helps if all files are linked to in web pages or in directory indexes.


I was trying to download zip files linked from Omeka's themes page - pretty similar task. This worked for me:

wget -A zip -r -l 1 -nd http://omeka.org/add-ons/themes/
  • -A: only accept zip files
  • -r: recurse
  • -l 1: one level deep (ie, only files directly linked from this page)
  • -nd: don't create a directory structure, just download all the files into this directory.

All the answers with -k, -K, -E etc options probably haven't really understood the question, as those as for rewriting HTML pages to make a local structure, renaming .php files and so on. Not relevant.

To literally get all files except .html etc:

wget -R html,htm,php,asp,jsp,js,py,css -r -l 1 -nd http://yoursite.com

You may try:

wget --user-agent=Mozilla --content-disposition --mirror --convert-links -E -K -p http://example.com/

Also you can add:

-A pdf,ps,djvu,tex,doc,docx,xls,xlsx,gz,ppt,mp4,avi,zip,rar

to accept the specific extensions, or to reject only specific extensions:

-R html,htm,asp,php

or to exclude the specific areas:

-X "search*,forum*"

If the files are ignored for robots (e.g. search engines), you've to add also: -e robots=off


Try this. It always works for me

wget --mirror -p --convert-links -P ./LOCAL-DIR WEBSITE-URL

wget -m -A * -pk -e robots=off www.mysite.com/

this will download all type of files locally and point to them from the html file and it will ignore robots file