Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Download files from an FTP server containing given string using Python

Tags:

python

ftp

ftplib

I'm trying to download a large number of files that all share a common string (DEM) from an FTP sever. These files are nested inside multiple directories. For example, Adair/DEM* and Adams/DEM*

The FTP sever is located here: ftp://ftp.igsb.uiowa.edu/gis_library/counties/ and requires no username and password. So, I'd like to go through each county and download the files containing the string DEM.

I've read many questions here on Stack Overflow and the documentation from Python, but cannot figure out how to use ftplib.FTP() to get into the site without a username and password (which is not required), and I can't figure out how to grep or use glob.glob inside of ftplib or urllib.

Thanks in advance for your help

like image 387
geos Avatar asked Aug 14 '16 14:08

geos


3 Answers

Ok, seems to work. There may be issues if trying to download a directory, or scan a file. Exception handling may come handy to trap wrong filetypes and skip.

glob.glob cannot work since you're on a remote filesystem, but you can use fnmatch to match the names

Here's the code: it download all files matching *DEM* in TEMP directory, sorting by directory.

import ftplib,sys,fnmatch,os

output_root = os.getenv("TEMP")

fc = ftplib.FTP("ftp.igsb.uiowa.edu")
fc.login()
fc.cwd("/gis_library/counties")

root_dirs = fc.nlst()
for l in root_dirs:
    sys.stderr.write(l + " ...\n")
    #print(fc.size(l))
    dir_files = fc.nlst(l)
    local_dir = os.path.join(output_root,l)
    if not os.path.exists(local_dir):
        os.mkdir(local_dir)

    for f in dir_files:
        if fnmatch.fnmatch(f,"*DEM*"):   # cannot use glob.glob
            sys.stderr.write("downloading "+l+"/"+f+" ...\n")
            local_filename = os.path.join(local_dir,f)
            with open(local_filename, 'wb') as fh:
                fc.retrbinary('RETR '+ l + "/" + f, fh.write)

fc.close()
like image 125
Jean-François Fabre Avatar answered Oct 09 '22 13:10

Jean-François Fabre


The answer by @Jean with the local pattern matching is the correct portable solution adhering to FTP standards.

Though as most FTP servers do support non-standard wildcard use with file listing commands, you can almost always use a simpler and mainly more efficient solution like:

files = ftp.nlst("*DEM*")
for f in files:
    with open(f, 'wb') as fh:
        ftp.retrbinary('RETR ' + f, fh.write)
like image 33
Martin Prikryl Avatar answered Oct 09 '22 11:10

Martin Prikryl


You can use fsspecs FTPFileSystem for convenient globbing on an FTP server:

import fsspec.implementations.ftp
ftpfs = fsspec.implementations.ftp.FTPFileSystem("ftp.ncdc.noaa.gov")
files = ftpfs.glob("/pub/data/swdi/stormevents/csvfiles/*1985*")
print(files)
contents = ftpfs.cat(files[0])
print(contents[:100])

Result:

['/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1985_c20160223.csv.gz', '/pub/data/swdi/stormevents/csvfiles/StormEvents_fatalities-ftp_v1.0_d1985_c20160223.csv.gz', '/pub/data/swdi/stormevents/csvfiles/StormEvents_locations-ftp_v1.0_d1985_c20160223.csv.gz']
b'\x1f\x8b\x08\x08\xcb\xd8\xccV\x00\x03StormEvents_details-ftp_v1.0_d1985_c20160223.csv\x00\xd4\xfd[\x93\x1c;r\xe7\x8b\xbe\x9fOA\xe3\xd39f\xb1h\x81[\\\xf8\x16U\x95\xac\xca\xc5\xacL*3\x8b\xd5\xd4\x8bL\xd2\xb4\x9d'

A nested search also works, for example, nested_files = ftpfs.glob("/pub/data/swdi/stormevents/**1985*"), but it can be quite slow.

like image 2
gerrit Avatar answered Oct 09 '22 13:10

gerrit