Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a list of available files using wget or curl?

I'd like to know if it's possible to do an ls of a URL, so I can see what *.js files are available in a website, for example. Something like:

wget --list-files -A.js stackoverflow.com

and get

ajax/libs/jquery/1.7.1/jquery.min.js
js/full.js
js/stub.js
...
like image 862
nachocab Avatar asked May 13 '12 11:05

nachocab


People also ask

How do I download a list of files from wget?

In order to download a file using Wget, type wget followed by the URL of the file that you wish to download. Wget will download the file in the given URL and save it in the current directory.

How do I download all files with curl?

Grab file with curl run: $ curl https://your-domain/file.pdf. Get file using ftp or sftp protocol: $ curl ftp://ftp-your-domain-name/file.tar.gz. You can set the output file name while downloading file with the curl, execute: $ curl -o file. pdf https://your-domain-name/long-file-name.pdf.

Is wget or curl more powerful?

cURL is more powerful and can perform more tasks than Wget, which is just a simple tool. cURL works on many more protocols such as IMAP, POP3, RTMP, RTSP, which Wget doesn't. With Wget, you can download content recursively.

Should I use wget or curl?

Differences Between wget and cURLWget is a simple transfer utility, while curl offers so much more. Curl provides the libcurl library, which can be expanded into GUI applications. Wget, on the other hand, is a simple command-line utility. Wget supports fewer protocols compared to cURL.


2 Answers

You can't do the equivalent of an ls unless the server provides such listings itself. You could however retrieve index.html and then check for includes, e.g. something like

wget -O - http://www.example.com | grep "type=.\?text/javascript.\?"

Note that this relies on the HTML being formatted in a certain way -- in this case with the includes on individual lines for example. If you want to do this properly, I'd recommend parsing the HTML and extracting the javascript includes that way.

like image 94
Lars Kotthoff Avatar answered Oct 11 '22 08:10

Lars Kotthoff


Let's consider this open directory (http://tug.ctan.org/macros/latex2e/required/amscls/) as the object of our experimentation. This directory belongs to the Comprehensive TeX Archive Network, so don't be too worried about downloading malicious files.

Now, let's suppose that we want to list all files whose extension is pdf. We can do so by executing the following command.

The command shown below will save the output of wget in the file main.log. Because wget send a request for each file and it prints some information about the request, we can then grep the output to get a list of files which belong to the specified directory.

wget \
  --accept '*.pdf' \
  --reject-regex '/\?C=[A-Z];O=[A-Z]$' \
  --execute robots=off \
  --recursive \
  --level=0 \
  --no-parent \
  --spider \
  'http://tug.ctan.org/macros/latex2e/required/amscls/doc/' 2>&1 | tee main.log

Now, we can list the files whose extension is pdf by using grep.

grep '^--' main.log
--2020-11-23 10:39:46--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/
--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/
--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsbooka.pdf
--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsclass.pdf
--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsdtx.pdf
--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsmidx.pdf
--2020-11-23 10:39:48--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsthdoc.pdf
--2020-11-23 10:39:48--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/thmtest.pdf
--2020-11-23 10:39:48--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/upref.pdf

Note that we could also get the list of all files in the directory and then execute grep on the output of the command. However, doing this would have taken more time since apparently a request is sent for each file. By using the --accept, we can make wget send a request for only those files in which we are interested in.

Last but not least, the sizes of the files are saved in the file main.log, so you can check that information in that file.

like image 20
doltes Avatar answered Oct 11 '22 07:10

doltes