Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

wget: obtaining files matching regex

According to the man page of wget, --acccept-regex is the argument to use when I need to selectively transfer files whose names matching a certain regular expression. However, I am not sure how to use --accept-regex.

Assuming I want to obtain files diffs-000107.tar.gz, diffs-000114.tar.gz, diffs-000121.tar.gz, diffs-000128.tar.gz in IMDB data directory ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/. "diffs\-0001[0-9]{2}\.tar\.gz" seems to be an ok regex to describe the file names.

However, when executing the following wget command

wget -r --accept-regex='diffs\-0001[0-9]{2}\.tar\.gz' ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/

wget indiscriminately acquires all files in the ftp://ftp.fu-berlin.de/pub/misc/movies/database/diffs/ directory.

I wonder if anyone could tell what I have possibly done wrong?

like image 276
Zhongjun 'Mark' Jin Avatar asked Apr 21 '17 05:04

Zhongjun 'Mark' Jin


People also ask

How to specify regular expression in Wget -R key?

You can not specify a regular expression in the wget -R key, but you can specify a template (like file template in a shell). $ wget -R 'newsbrief-*' ...

How do I use a regular expression to match a filename?

A regular expression to match valid filenames. It can be used to validate filenames entered by a user of an application, or the filename of files uploaded from a scanner. The expression ensures that your filename conforms to specific rules, including no leading or trailing spaces and no use of any characters besides the letters A-Z and numbers 0-9.

How many times can --reject-regex be used with Wget?

Beware that it seems you can use --reject-regex only once per wget call. That is, you have to use | in a single regex if you want to select on several regex :

Do you have to use | in a single regex?

That is, you have to use | in a single regex if you want to select on several regex : Thanks for the example with several regex. Does reject-regex work with things like . or *, what kind of regex is it, extended regex or PCRE regex?


1 Answers

Be careful --accept-regex is for the complete URL. But our target is some specific files. So we will use -A.

For example,

wget -r -np -nH -A "IMG[012][0-9].jpg" http://x.com/y/z/ 

will download all the files from IMG00.jpg to IMG29.jpg from the URL.

Note that a matching pattern contains shell-like wildcards, e.g. ‘books’ or ‘zelazny196[0-9]*’.

reference: wget manual: https://www.gnu.org/software/wget/manual/wget.html regex: https://regexone.com/

like image 89
Yuchao Jiang Avatar answered Oct 21 '22 13:10

Yuchao Jiang