Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recursive list.files for FTP-Server

Tags:

r

recursion

ftp

is there a ftp-version of list.files(path, recursive=TRUE)?

I want to get all the URL's of the ZIP-Archieves in subdirectories on this FTP-Server

url <- "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/"

so i want to get a list of all files in this directory:
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/wind/recent/ as well as
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/air_temperature/historical/ and so on

With RCurl i managed to download the dirlist of this directory but to not to get a comprehensive list of all zip-archieves in all subdirectories. Any advice other than looping trough the directories and getting the dirlists one by one?

RCurl code so far:

dwd_dirlist <- function(url, full = TRUE){
  dir <- unlist(
    strsplit(
      getURL(url,
             ftp.use.epsv = FALSE,
             dirlistonly = TRUE),
      "\n")
    )
  if(full) dir <- paste0(url, dir)
  return(dir)
}
like image 764
Rentrop Avatar asked Oct 27 '14 13:10

Rentrop


People also ask

How do I recursively list files?

Try any one of the following commands to see recursive directory listing: ls -R : Use the ls command to get recursive directory listing on Linux. find /dir/ -print : Run the find command to see recursive directory listing in Linux. du -a . : Execute the du command to view recursive directory listing on Unix.

How can we list all files and directories in FTP server?

dir -R = Lists all files in current directory and sub directories. dir -S = Lists files in bare format in alphabetic order. Exits from FTP.

How do I get all files from FTP?

To retrieve files, use the mget command. For example, to retrieve files named myfile1 , myfile2 , and myfile3 from another computer, at the FTP prompt, enter: mget myfile? If prompted, enter y to transfer each file.


1 Answers

If you have the lftp utility installed on your system, then you can use its find command to recursively list files underneath a specified directory. Here's a link to the documentation; the description for find is near the top.

Unfortunately, as you can see from the documentation, and unlike the common Unix find utility, lftp's find command doesn't support very many options at all; only --max-depth and --list (for a long listing), so you can't use the -name, -regex, etc. predicates that the find utility normally provides. On the other hand, lftp does support a very unusual but powerful feature in that it allows you to pipe output to local tools, so you could, for example, pipe the find output to your local grep from inside the lftp command-line. Of course, there's nothing stopping you from grepping in a shell pipeline, or filtering back in Rland. Here's an example using an lftp pipeline (as you can see, a disadvantage of this approach is that the multiple levels of escaping get pretty convoluted):

url <- 'ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/';
zips <- system(paste0('lftp ',url,' <<<\'find| grep "\\\\.zip$"; exit;\';'),intern=T);
zips;
##    [1] "./air_temperature/historical/stundenwerte_TU_00003_19500401_20110331_hist.zip"
##    [2] "./air_temperature/historical/stundenwerte_TU_00044_20070401_20141231_hist.zip"
##    [3] "./air_temperature/historical/stundenwerte_TU_00052_19760101_19880101_hist.zip"
##    [4] "./air_temperature/historical/stundenwerte_TU_00071_20091201_20141231_hist.zip"
##
## ... snip ...
##
## [6616] "./wind/recent/stundenwerte_FF_15207_akt.zip"
## [6617] "./wind/recent/stundenwerte_FF_15214_akt.zip"
## [6618] "./wind/recent/stundenwerte_FF_15444_akt.zip"
## [6619] "./wind/recent/stundenwerte_FF_15520_akt.zip"

Also, just for the heck of it, if you want another approach, I've written a function that can parse the output of an ls -l listing using regular expressions, returning all fields in a data.frame. A simple modification allows it to work over ftp using lftp:

longListing <- function(url='',recursive=F,all=F) {
    ## returns a data.frame of long-listing fields
    ## requires lftp for ftp support

    ## validate arguments
    url <- as.character(url);
    if (length(url) != 1L) stop('url argument must have length 1.');
    recursive <- as.logical(recursive);
    if (length(recursive) != 1L) stop('recursive argument must have length 1.');
    all <- as.logical(all);
    if (length(all) != 1L) stop('all argument must have length 1.');

    ## escape and single-quote url, or leave empty for pwd if empty
    urlEsc <- if (url == '') '' else paste0('\'',sub("'","'\\''",url),'\'');

    ## construct ls command with options; identical between local ls and lftp ls
    ## technically lftp ls doesn't require -l to get a long listing, but it accepts it
    lsCmd <- paste0('ls -l',if (recursive) ' -R',if (all) ' -A');

    ## run system command to get long-listing output lines
    if (substr(url,0L,6L) == 'ftp://') { ## ftp
        output <- system(paste0('lftp ',urlEsc,' <<<\'',lsCmd,'; exit;\';'),intern=T);
    } else { ## local
        output <- system(paste0(lsCmd,' ',urlEsc,';'),intern=T);
    }; ## end if

    ## define regexes for parsing the output
    ## note: accept question marks for items whose metadata cannot be read
    sp0RE <- '\\s*';
    sp1RE <- '\\s+';
    typeRE <- '([?dlcbps-])';
    rRE <- '([?r-])';
    wRE <- '([?w-])';
    xRE <- '([?xsStT-])';
    aclRE <- '([?+@]*)';
    permRE <- paste0(typeRE,rRE,wRE,xRE,rRE,wRE,xRE,rRE,wRE,xRE,aclRE);
    linksRE <- '(\\?|[0-9]+)';
    ocRE <- '[a-zA-Z_0-9.$+-]';
    ocsRE <- '[a-zA-Z_0-9 .$+-]'; ## badly-behaving names can have spaces; non-greedy will prevent excessive gobbling
    ownerRE <- paste0('(\\?|',ocRE,'|',ocRE,ocsRE,'*?',ocRE,')');
    groupRE <- ownerRE; ## same compatibility rules as owner
    sizeRE <- '(?:\\?|(?:([0-9]+),\\s*)?([0-9]+))'; ## major, minor for special files, plain size for rest
    monthRE <- '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)';
    dayRE <- '([0-9]+)';
    timeRE <- '([0-9]{2}:[0-9]{2}|[0-9]+)'; ## could be year
    dtRE <- paste0('(?:\\?|',monthRE,sp1RE,dayRE,sp1RE,timeRE,')');
    nameRE <- '(.*?)'; ## make non-greedy to allow target to be captured, if present
    targetRE <- '(?:\\s+->\\s+(.*))?'; ## target is optional; shown on some platforms, e.g. Cygwin
    recordRE <- paste0(
        '^'
        ,permRE,sp1RE
        ,linksRE,sp1RE
        ,ownerRE,sp1RE
        ,groupRE,sp1RE
        ,sizeRE,sp1RE
        ,dtRE,sp1RE
        ,nameRE,targetRE ## target is optional; targetRE defines its own whitespace separation
        ,sp0RE,'$' ## ignore trailing whitespace
    );

    ## get indexes of listing records
    recordIndexes <- grep(recordRE,output);

    ## get indexes of blanks and directory headers for maximally robust matching
    blankIndexes <- grep('^\\s*$',output);
    headerIndexes <- grep(':$',output); ## questionable specificity

    ## pare headers down to those with preceding blank
    headerIndexes <- headerIndexes[(headerIndexes-1)%in%c(0L,blankIndexes)]; ## include zero for possible first-line header

    ## match recordIndexes into headerIndexes to look up parent path; direct children will be zero
    recordHeaderIndexes <- findInterval(recordIndexes,headerIndexes);

    ## derive parent paths with trailing slash, or empty string for direct children
    parentPaths <- c('',sub(':','/',output[headerIndexes]))[recordHeaderIndexes+1L];
    parentPaths <- sub('^\\./','',parentPaths); ## for aesthetics

    ## match record lines and extract capture groups
    reg <- regmatches(output[recordIndexes],regexec(recordRE,output[recordIndexes]));

    ## build data.frame with reg fields
    ret <- data.frame(type=sapply(reg,`[`,2L),stringsAsFactors=F); ## start with type to set the row count
    i <- 3L;
    ## note: size is actually minor for character- and block-special files
    for (cn in c('ur','uw','ux','gr','gw','gx','or','ow','ox','acl','links','owner','group','major','size','month','day','time','path','target')) {
        ret[[cn]] <- sapply(reg,`[`,i);
        i <- i+1L;
    }; ## end for

    ## prepend parent paths to listing paths
    ret$path <- paste0(parentPaths,ret$path);

    ret;

}; ## end longListing()

Here's a demo of it on a directory of special files I created on my system:

longListing();
##    type ur uw ux gr gw gx or ow ox acl links owner group major size month day  time                      path            target
## 1     d  r  w  x  r  -  -  r  -  -   +     1  user  None          0   Feb  27 08:21                       dir
## 2     d  r  w  x  r  w  x  r  w  x   +     1  user  None          0   Feb  27 08:21        dir-other-writable
## 3     d  r  w  x  r  -  -  r  -  T   +     1  user  None          0   Feb  27 08:21                dir-sticky
## 4     d  r  w  x  r  w  x  r  w  t   +     1  user  None          0   Feb  27 08:21 dir-sticky-other-writable
## 5     -  r  w  -  r  -  -  r  -  -         2  user  None          0   Feb  27 08:21                      file
## 6     -  r  w  -  r  -  -  r  -  -         1  user  None          0   Feb  27 08:21          file-archive.tar
## 7     -  r  w  -  r  -  -  r  -  -         1  user  None          0   Feb  27 08:21            file-audio.mp3
## 8     b  r  w  -  r  w  -  r  w  -         1  user  None     0    1   Feb  27 08:21        file-block-special
## 9     c  r  w  -  r  w  -  r  w  -         1  user  None     0    1   Feb  27 08:21    file-character-special
## 10    -  r  w  x  r  w  x  r  w  x         1  user  None         12   Feb  27 08:21                  file-exe
## 11    p  r  w  -  r  w  -  r  w  -         1  user  None          0   Feb  27 08:21                 file-fifo
## 12    -  r  w  -  r  -  -  r  -  -         1  user  None          0   Feb  27 08:21            file-image.bmp
## 13    -  r  w  -  r  w  S  r  -  -         1  user  None          0   Feb  27 08:21               file-setgid
## 14    -  r  w  x  r  w  s  r  -  x         1  user  None          0   Feb  27 08:21           file-setgid-exe
## 15    -  r  w  S  r  w  -  r  -  -         1  user  None          0   Feb  27 08:21               file-setuid
## 16    -  r  w  s  r  w  x  r  -  x         1  user  None          0   Feb  27 08:21           file-setuid-exe
## 17    s  r  w  -  r  w  -  r  -  -         1  user  None          0   Feb  27 08:21               file-socket
## 18    l  r  w  x  r  w  x  r  w  x         1  user  None          4   Feb  27 08:21               ln-existing              file
## 19    -  r  w  -  r  -  -  r  -  -         2  user  None          0   Feb  27 08:21                   ln-hard
## 20    l  r  w  x  r  w  x  r  w  x         1  user  None         17   Feb  27 08:21           ln-non-existing file-non-existing

Demo on your site:

url <- 'ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/';
ll <- longListing(url,T,T);
ll;
##      type ur uw ux gr gw gx or ow ox acl links owner   group major    size month day  time                                                                                                  path target
## 1       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Jun   5  2014                                                                                       air_temperature
## 2       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Sep  25  2014                                                                                            cloudiness
## 3       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Nov  13  2014                                                                                         precipitation
## 4       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Nov  13  2014                                                                                              pressure
## 5       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Jun   5  2014                                                                                      soil_temperature
## 6       d  r  w  x  r  w  x  -  -  x         2 32230 ftp-dwd         12288   Dec  15 11:52                                                                                                 solar
## 7       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Jun   5  2014                                                                                                   sun
## 8       d  r  w  x  r  w  x  -  -  x         4 32230 ftp-dwd          4096   Apr  17  2015                                                                                                  wind
## 9       d  r  w  x  r  w  x  -  -  x         2 32230 ftp-dwd        114688   Oct  15 12:35                                                                            air_temperature/historical
## 10      d  r  w  x  r  w  x  -  -  x         2 32230 ftp-dwd        151552   Dec   4 10:28                                                                                air_temperature/recent
## 11      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         68727   Jan  26 09:55                air_temperature/historical/BESCHREIBUNG_obsgermany_climate_hourly_tu_historical_de.pdf
## 12      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         68600   Jan  26 09:55                 air_temperature/historical/DESCRIPTION_obsgermany_climate_hourly_tu_historical_en.pdf
## 13      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd        123634   Mar  27  2015                                 air_temperature/historical/TU_Stundenwerte_Beschreibung_Stationen.txt
## 14      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd       2847045   Mar  27  2015                           air_temperature/historical/stundenwerte_TU_00003_19500401_20110331_hist.zip
## 15      -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd        359517   Mar  27  2015                           air_temperature/historical/stundenwerte_TU_00044_20070401_20141231_hist.zip
##
## ... snip ...
##
## 6683    -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         65633   Feb  27 10:26                                                             wind/recent/stundenwerte_FF_15207_akt.zip
## 6684    -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         66910   Feb  27 10:21                                                             wind/recent/stundenwerte_FF_15214_akt.zip
## 6685    -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         64525   Feb  27 10:19                                                             wind/recent/stundenwerte_FF_15444_akt.zip
## 6686    -  r  w  -  r  w  -  -  -  -         1 32230 ftp-dwd         23717   Feb  27 10:21                                                             wind/recent/stundenwerte_FF_15520_akt.zip

You could extract just the zip file names easily:

zips <- ll$path[ll$type=='-' & grepl('\\.zip$',ll$path)];
length(zips);
## [1] 6619
like image 148
bgoldst Avatar answered Sep 29 '22 13:09

bgoldst