I'm trying to download a large number of files that all share a common string (DEM
) from an FTP sever. These files are nested inside multiple directories. For example, Adair/DEM*
and Adams/DEM*
The FTP sever is located here: ftp://ftp.igsb.uiowa.edu/gis_library/counties/
and requires no username and password.
So, I'd like to go through each county and download the files containing the string DEM
.
I've read many questions here on Stack Overflow and the documentation from Python, but cannot figure out how to use ftplib.FTP()
to get into the site without a username and password (which is not required), and I can't figure out how to grep or use glob.glob
inside of ftplib or urllib.
Thanks in advance for your help
Ok, seems to work. There may be issues if trying to download a directory, or scan a file. Exception handling may come handy to trap wrong filetypes and skip.
glob.glob
cannot work since you're on a remote filesystem, but you can use fnmatch
to match the names
Here's the code: it download all files matching *DEM*
in TEMP directory, sorting by directory.
import ftplib,sys,fnmatch,os
output_root = os.getenv("TEMP")
fc = ftplib.FTP("ftp.igsb.uiowa.edu")
fc.login()
fc.cwd("/gis_library/counties")
root_dirs = fc.nlst()
for l in root_dirs:
sys.stderr.write(l + " ...\n")
#print(fc.size(l))
dir_files = fc.nlst(l)
local_dir = os.path.join(output_root,l)
if not os.path.exists(local_dir):
os.mkdir(local_dir)
for f in dir_files:
if fnmatch.fnmatch(f,"*DEM*"): # cannot use glob.glob
sys.stderr.write("downloading "+l+"/"+f+" ...\n")
local_filename = os.path.join(local_dir,f)
with open(local_filename, 'wb') as fh:
fc.retrbinary('RETR '+ l + "/" + f, fh.write)
fc.close()
The answer by @Jean with the local pattern matching is the correct portable solution adhering to FTP standards.
Though as most FTP servers do support non-standard wildcard use with file listing commands, you can almost always use a simpler and mainly more efficient solution like:
files = ftp.nlst("*DEM*")
for f in files:
with open(f, 'wb') as fh:
ftp.retrbinary('RETR ' + f, fh.write)
You can use fsspec
s FTPFileSystem
for convenient globbing on an FTP server:
import fsspec.implementations.ftp
ftpfs = fsspec.implementations.ftp.FTPFileSystem("ftp.ncdc.noaa.gov")
files = ftpfs.glob("/pub/data/swdi/stormevents/csvfiles/*1985*")
print(files)
contents = ftpfs.cat(files[0])
print(contents[:100])
Result:
['/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1985_c20160223.csv.gz', '/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d1985_c20160223.csv.gz', '/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d1985_c20160223.csv.gz']
b'\x1f\x8b\x08\x08\xcb\xd8\xccV\x00\x03StormEvents_details-ftp_v1.0_d1985_c20160223.csv\x00\xd4\xfd[\x93\x1c;r\xe7\x8b\xbe\x9fOA\xe3\xd39f\xb1h\x81[\\\xf8\x16U\x95\xac\xca\xc5\xacL*3\x8b\xd5\xd4\x8bL\xd2\xb4\x9d'
A nested search also works, for example, nested_files = ftpfs.glob("/pub/data/swdi/stormevents/**1985*")
, but it can be quite slow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With