I'm trying to understand how to use wget to download specific directories from a bunch of different ftp sites with economic data from the US government. As a simple example, I know that I can download an entire directory using a command like: <pre class="prettyprint"><code>wget --timestamping --recursive --no-parent ftp://ftp.bls.gov/pub/special.requests/cew/2013/county/ </code></pre> But I envision running more complex downloads, where I might want to limit a download to a handful of directories. So I've been looking at the --include option. But I don't really understand how it works. Specifically, why doesn't this work: <pre class="prettyprint"><code>wget --timestamping --recursive -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/ </code></pre> The following does work, in the sense that it downloads files, but it downloads way more than I need (everything in the 2013 directory, vs just the county subdirectory): <pre class="prettyprint"><code>wget --timestamping --recursive -I /pub/special.requests/cew/2013/ ftp://ftp.bls.gov/pub/special.requests/cew/ </code></pre> I can't tell if i'm not understanding something about wget or if my issue is with something more fundamental to ftp server structures. Thanks for the help!

Based on this doc it seems that the filtering functions of <code>wget</code> are very limited. When using the <code>--recursive</code> option, <code>wget</code> will download all linked documents after applying the various filters, such as <code>--no-parent</code> and <code>-I</code>, <code>-X</code>, <code>-A</code>, <code>-R</code> options. In your example: <pre class="prettyprint"><code>wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/ </code></pre> This won't download anything, because the <code>-I</code> option specifies to include only links matching <code>/pub/special.requests/cew/2013/county/</code>, but on the page <code>/pub/special.requests/cew/</code> there are no such links, so the download stops there. This will work though: <pre class="prettyprint"><code>wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/2013/ </code></pre> ... because in this case the <code>/pub/special.requests/cew/2013/</code> page does have a link to <code>county/</code> Btw, you can find more details in this doc than on the <code>man</code> page: http://www.gnu.org/software/wget/manual/html_node/

Using wget to download select directories from ftp server

Tags:

linux

unix

wget

ubuntu

ftp

I'm trying to understand how to use wget to download specific directories from a bunch of different ftp sites with economic data from the US government.

As a simple example, I know that I can download an entire directory using a command like:

wget  --timestamping  --recursive --no-parent ftp://ftp.bls.gov/pub/special.requests/cew/2013/county/

But I envision running more complex downloads, where I might want to limit a download to a handful of directories. So I've been looking at the --include option. But I don't really understand how it works. Specifically, why doesn't this work:

wget --timestamping --recursive -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/

The following does work, in the sense that it downloads files, but it downloads way more than I need (everything in the 2013 directory, vs just the county subdirectory):

wget --timestamping --recursive -I /pub/special.requests/cew/2013/ ftp://ftp.bls.gov/pub/special.requests/cew/

I can't tell if i'm not understanding something about wget or if my issue is with something more fundamental to ftp server structures.

Thanks for the help!

591

asked Dec 23 '13 21:12

Al R.

1 Answers

Based on this doc it seems that the filtering functions of wget are very limited.

When using the --recursive option, wget will download all linked documents after applying the various filters, such as --no-parent and -I, -X, -A, -R options.

In your example:

wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/

This won't download anything, because the -I option specifies to include only links matching /pub/special.requests/cew/2013/county/, but on the page /pub/special.requests/cew/ there are no such links, so the download stops there. This will work though:

wget -r -I /pub/special.requests/cew/2013/county/ ftp://ftp.bls.gov/pub/special.requests/cew/2013/

... because in this case the /pub/special.requests/cew/2013/ page does have a link to county/

Btw, you can find more details in this doc than on the man page:

http://www.gnu.org/software/wget/manual/html_node/

answered Oct 02 '22 11:10

janos

Related questions
                            
                                script not reading last line of a file
                            
                                Why use ${@+"$@"} in shell scripts?
                            
                                OpenCV Error: Sizes of input arguments do not match (The operation is neither 'array op array')
                            
                                Restoring keyboard settings in Xorg environment after suspending
                            
                                Proper use of LD_LIBRARY_PATH or ldconfig for a software package
                            
                                why "extra characters after command" error shown for the sed command line shown?
                            
                                running a persistent python script from systemd?
                            
                                Got error: No rule to make target while compiling linux Kernel
                            
                                Edit CMakeLists.txt to compile with -fPIC
                            
                                How to use dlsym reliably when you have duplicated symbols?
                            
                                socket.gaierror: [Errno -2] Name or service not known
                            
                                Sending a struct from kernel to userland via netlink
                            
                                Execute command on the same line multiple times with sed
                            
                                How does /usr/bin/time measure memory usage?
                            
                                SDL2 - Check if OpenGL context is created
                            
                                linux application get Killed
                            
                                Python shutil.copy fails on FAT file systems (Ubuntu)
                            
                                curl: (2) Failed Initialization
                            
                                Linux Shell script what dirname and ? means?
                            
                                correct use of linux inotify - reopen every time?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With