Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting a list of files on a web server

Tags:

http

curl

wget

All,

I would like to get a list of files off of a server with the full url in tact. For example, I would like to get all the TIFFs from here.

http://hyperquad.telascience.org/naipsource/Texas/20100801/*

I can download all the .tif files with wget but I am looking for is just the full url to each file like this.

http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_2_20100430.tif http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_3_20100424.tif http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_4_20100430.tif http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_05_1_20100430.tif http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_05_2_20100430.tif

Any thoughts on how to get all these files in to a list using something like curl or wget?

Adam

like image 272
aeupinhere Avatar asked Aug 08 '11 23:08

aeupinhere


2 Answers

You'd need the server to be willing to give you a page with a listing on it. This would normally be an index.html or just ask for the directory.

http://hyperquad.telascience.org/naipsource/Texas/20100801/

It looks like you're in luck in this case so, at risk of upsetting the web master, the solution would be to use wget's recursive option. Specify a maximum recursion of 1 to keep it constrained to that single directory.

like image 160
Richard Corfield Avatar answered Oct 10 '22 01:10

Richard Corfield


I would use lynx shell web browser to get the list of links + grep and awk shell tools to filter the results, like this:

lynx -dump -listonly <URL> | grep http | grep <regexp> | awk '{print $2}'

..where:

  • URL - is the start URL, in your case: http://hyperquad.telascience.org/naipsource/Texas/20100801/
  • regexp - is the regular expression that selects only files that interest you, in your case: \.tif$


Complete example commandline to get links to TIF files on this SO page:

lynx -dump -listonly http://stackoverflow.com/questions/6989681/getting-a-list-of-files-on-a-web-server | grep http | grep \.tif$ | awk '{print $2}'

..now returns:

http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_2_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_04_4_20100430.tif
http://hyperquad.telascience.org/naipsource/Texas/20100801/naip10_1m_2597_05_2_20100430.tif
like image 45
Greg Dubicki Avatar answered Oct 10 '22 00:10

Greg Dubicki