is there a ftp-version of list.files(path, recursive=TRUE)
?
I want to get all the URL's of the ZIP-Archieves in subdirectories on this FTP-Server
url <- "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/"
so i want to get a list of all files in this directory:
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/wind/recent/
as well as
ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/air_temperature/historical/
and so on
With RCurl
i managed to download the dirlist of this directory but to not to get a comprehensive list of all zip-archieves in all subdirectories.
Any advice other than looping trough the directories and getting the dirlists one by one?
RCurl code so far:
dwd_dirlist <- function(url, full = TRUE){
dir <- unlist(
strsplit(
getURL(url,
ftp.use.epsv = FALSE,
dirlistonly = TRUE),
"\n")
)
if(full) dir <- paste0(url, dir)
return(dir)
}
Try any one of the following commands to see recursive directory listing: ls -R : Use the ls command to get recursive directory listing on Linux. find /dir/ -print : Run the find command to see recursive directory listing in Linux. du -a . : Execute the du command to view recursive directory listing on Unix.
dir -R = Lists all files in current directory and sub directories. dir -S = Lists files in bare format in alphabetic order. Exits from FTP.
To retrieve files, use the mget command. For example, to retrieve files named myfile1 , myfile2 , and myfile3 from another computer, at the FTP prompt, enter: mget myfile? If prompted, enter y to transfer each file.
If you have the lftp
utility installed on your system, then you can use its find
command to recursively list files underneath a specified directory. Here's a link to the documentation; the description for find
is near the top.
Unfortunately, as you can see from the documentation, and unlike the common Unix find
utility, lftp
's find
command doesn't support very many options at all; only --max-depth
and --list
(for a long listing), so you can't use the -name
, -regex
, etc. predicates that the find
utility normally provides. On the other hand, lftp
does support a very unusual but powerful feature in that it allows you to pipe output to local tools, so you could, for example, pipe the find
output to your local grep
from inside the lftp
command-line. Of course, there's nothing stopping you from grepping in a shell pipeline, or filtering back in Rland. Here's an example using an lftp
pipeline (as you can see, a disadvantage of this approach is that the multiple levels of escaping get pretty convoluted):
url <- 'ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/';
zips <- system(paste0('lftp ',url,' <<<\'find| grep "\\\\.zip$"; exit;\';'),intern=T);
zips;
## [1] "./air_temperature/historical/stundenwerte_TU_00003_19500401_20110331_hist.zip"
## [2] "./air_temperature/historical/stundenwerte_TU_00044_20070401_20141231_hist.zip"
## [3] "./air_temperature/historical/stundenwerte_TU_00052_19760101_19880101_hist.zip"
## [4] "./air_temperature/historical/stundenwerte_TU_00071_20091201_20141231_hist.zip"
##
## ... snip ...
##
## [6616] "./wind/recent/stundenwerte_FF_15207_akt.zip"
## [6617] "./wind/recent/stundenwerte_FF_15214_akt.zip"
## [6618] "./wind/recent/stundenwerte_FF_15444_akt.zip"
## [6619] "./wind/recent/stundenwerte_FF_15520_akt.zip"
Also, just for the heck of it, if you want another approach, I've written a function that can parse the output of an ls -l
listing using regular expressions, returning all fields in a data.frame. A simple modification allows it to work over ftp using lftp
:
longListing <- function(url='',recursive=F,all=F) {
## returns a data.frame of long-listing fields
## requires lftp for ftp support
## validate arguments
url <- as.character(url);
if (length(url) != 1L) stop('url argument must have length 1.');
recursive <- as.logical(recursive);
if (length(recursive) != 1L) stop('recursive argument must have length 1.');
all <- as.logical(all);
if (length(all) != 1L) stop('all argument must have length 1.');
## escape and single-quote url, or leave empty for pwd if empty
urlEsc <- if (url == '') '' else paste0('\'',sub("'","'\\''",url),'\'');
## construct ls command with options; identical between local ls and lftp ls
## technically lftp ls doesn't require -l to get a long listing, but it accepts it
lsCmd <- paste0('ls -l',if (recursive) ' -R',if (all) ' -A');
## run system command to get long-listing output lines
if (substr(url,0L,6L) == 'ftp://') { ## ftp
output <- system(paste0('lftp ',urlEsc,' <<<\'',lsCmd,'; exit;\';'),intern=T);
} else { ## local
output <- system(paste0(lsCmd,' ',urlEsc,';'),intern=T);
}; ## end if
## define regexes for parsing the output
## note: accept question marks for items whose metadata cannot be read
sp0RE <- '\\s*';
sp1RE <- '\\s+';
typeRE <- '([?dlcbps-])';
rRE <- '([?r-])';
wRE <- '([?w-])';
xRE <- '([?xsStT-])';
aclRE <- '([?+@]*)';
permRE <- paste0(typeRE,rRE,wRE,xRE,rRE,wRE,xRE,rRE,wRE,xRE,aclRE);
linksRE <- '(\\?|[0-9]+)';
ocRE <- '[a-zA-Z_0-9.$+-]';
ocsRE <- '[a-zA-Z_0-9 .$+-]'; ## badly-behaving names can have spaces; non-greedy will prevent excessive gobbling
ownerRE <- paste0('(\\?|',ocRE,'|',ocRE,ocsRE,'*?',ocRE,')');
groupRE <- ownerRE; ## same compatibility rules as owner
sizeRE <- '(?:\\?|(?:([0-9]+),\\s*)?([0-9]+))'; ## major, minor for special files, plain size for rest
monthRE <- '(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)';
dayRE <- '([0-9]+)';
timeRE <- '([0-9]{2}:[0-9]{2}|[0-9]+)'; ## could be year
dtRE <- paste0('(?:\\?|',monthRE,sp1RE,dayRE,sp1RE,timeRE,')');
nameRE <- '(.*?)'; ## make non-greedy to allow target to be captured, if present
targetRE <- '(?:\\s+->\\s+(.*))?'; ## target is optional; shown on some platforms, e.g. Cygwin
recordRE <- paste0(
'^'
,permRE,sp1RE
,linksRE,sp1RE
,ownerRE,sp1RE
,groupRE,sp1RE
,sizeRE,sp1RE
,dtRE,sp1RE
,nameRE,targetRE ## target is optional; targetRE defines its own whitespace separation
,sp0RE,'$' ## ignore trailing whitespace
);
## get indexes of listing records
recordIndexes <- grep(recordRE,output);
## get indexes of blanks and directory headers for maximally robust matching
blankIndexes <- grep('^\\s*$',output);
headerIndexes <- grep(':$',output); ## questionable specificity
## pare headers down to those with preceding blank
headerIndexes <- headerIndexes[(headerIndexes-1)%in%c(0L,blankIndexes)]; ## include zero for possible first-line header
## match recordIndexes into headerIndexes to look up parent path; direct children will be zero
recordHeaderIndexes <- findInterval(recordIndexes,headerIndexes);
## derive parent paths with trailing slash, or empty string for direct children
parentPaths <- c('',sub(':','/',output[headerIndexes]))[recordHeaderIndexes+1L];
parentPaths <- sub('^\\./','',parentPaths); ## for aesthetics
## match record lines and extract capture groups
reg <- regmatches(output[recordIndexes],regexec(recordRE,output[recordIndexes]));
## build data.frame with reg fields
ret <- data.frame(type=sapply(reg,`[`,2L),stringsAsFactors=F); ## start with type to set the row count
i <- 3L;
## note: size is actually minor for character- and block-special files
for (cn in c('ur','uw','ux','gr','gw','gx','or','ow','ox','acl','links','owner','group','major','size','month','day','time','path','target')) {
ret[[cn]] <- sapply(reg,`[`,i);
i <- i+1L;
}; ## end for
## prepend parent paths to listing paths
ret$path <- paste0(parentPaths,ret$path);
ret;
}; ## end longListing()
Here's a demo of it on a directory of special files I created on my system:
longListing();
## type ur uw ux gr gw gx or ow ox acl links owner group major size month day time path target
## 1 d r w x r - - r - - + 1 user None 0 Feb 27 08:21 dir
## 2 d r w x r w x r w x + 1 user None 0 Feb 27 08:21 dir-other-writable
## 3 d r w x r - - r - T + 1 user None 0 Feb 27 08:21 dir-sticky
## 4 d r w x r w x r w t + 1 user None 0 Feb 27 08:21 dir-sticky-other-writable
## 5 - r w - r - - r - - 2 user None 0 Feb 27 08:21 file
## 6 - r w - r - - r - - 1 user None 0 Feb 27 08:21 file-archive.tar
## 7 - r w - r - - r - - 1 user None 0 Feb 27 08:21 file-audio.mp3
## 8 b r w - r w - r w - 1 user None 0 1 Feb 27 08:21 file-block-special
## 9 c r w - r w - r w - 1 user None 0 1 Feb 27 08:21 file-character-special
## 10 - r w x r w x r w x 1 user None 12 Feb 27 08:21 file-exe
## 11 p r w - r w - r w - 1 user None 0 Feb 27 08:21 file-fifo
## 12 - r w - r - - r - - 1 user None 0 Feb 27 08:21 file-image.bmp
## 13 - r w - r w S r - - 1 user None 0 Feb 27 08:21 file-setgid
## 14 - r w x r w s r - x 1 user None 0 Feb 27 08:21 file-setgid-exe
## 15 - r w S r w - r - - 1 user None 0 Feb 27 08:21 file-setuid
## 16 - r w s r w x r - x 1 user None 0 Feb 27 08:21 file-setuid-exe
## 17 s r w - r w - r - - 1 user None 0 Feb 27 08:21 file-socket
## 18 l r w x r w x r w x 1 user None 4 Feb 27 08:21 ln-existing file
## 19 - r w - r - - r - - 2 user None 0 Feb 27 08:21 ln-hard
## 20 l r w x r w x r w x 1 user None 17 Feb 27 08:21 ln-non-existing file-non-existing
Demo on your site:
url <- 'ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/hourly/';
ll <- longListing(url,T,T);
ll;
## type ur uw ux gr gw gx or ow ox acl links owner group major size month day time path target
## 1 d r w x r w x - - x 4 32230 ftp-dwd 4096 Jun 5 2014 air_temperature
## 2 d r w x r w x - - x 4 32230 ftp-dwd 4096 Sep 25 2014 cloudiness
## 3 d r w x r w x - - x 4 32230 ftp-dwd 4096 Nov 13 2014 precipitation
## 4 d r w x r w x - - x 4 32230 ftp-dwd 4096 Nov 13 2014 pressure
## 5 d r w x r w x - - x 4 32230 ftp-dwd 4096 Jun 5 2014 soil_temperature
## 6 d r w x r w x - - x 2 32230 ftp-dwd 12288 Dec 15 11:52 solar
## 7 d r w x r w x - - x 4 32230 ftp-dwd 4096 Jun 5 2014 sun
## 8 d r w x r w x - - x 4 32230 ftp-dwd 4096 Apr 17 2015 wind
## 9 d r w x r w x - - x 2 32230 ftp-dwd 114688 Oct 15 12:35 air_temperature/historical
## 10 d r w x r w x - - x 2 32230 ftp-dwd 151552 Dec 4 10:28 air_temperature/recent
## 11 - r w - r w - - - - 1 32230 ftp-dwd 68727 Jan 26 09:55 air_temperature/historical/BESCHREIBUNG_obsgermany_climate_hourly_tu_historical_de.pdf
## 12 - r w - r w - - - - 1 32230 ftp-dwd 68600 Jan 26 09:55 air_temperature/historical/DESCRIPTION_obsgermany_climate_hourly_tu_historical_en.pdf
## 13 - r w - r w - - - - 1 32230 ftp-dwd 123634 Mar 27 2015 air_temperature/historical/TU_Stundenwerte_Beschreibung_Stationen.txt
## 14 - r w - r w - - - - 1 32230 ftp-dwd 2847045 Mar 27 2015 air_temperature/historical/stundenwerte_TU_00003_19500401_20110331_hist.zip
## 15 - r w - r w - - - - 1 32230 ftp-dwd 359517 Mar 27 2015 air_temperature/historical/stundenwerte_TU_00044_20070401_20141231_hist.zip
##
## ... snip ...
##
## 6683 - r w - r w - - - - 1 32230 ftp-dwd 65633 Feb 27 10:26 wind/recent/stundenwerte_FF_15207_akt.zip
## 6684 - r w - r w - - - - 1 32230 ftp-dwd 66910 Feb 27 10:21 wind/recent/stundenwerte_FF_15214_akt.zip
## 6685 - r w - r w - - - - 1 32230 ftp-dwd 64525 Feb 27 10:19 wind/recent/stundenwerte_FF_15444_akt.zip
## 6686 - r w - r w - - - - 1 32230 ftp-dwd 23717 Feb 27 10:21 wind/recent/stundenwerte_FF_15520_akt.zip
You could extract just the zip file names easily:
zips <- ll$path[ll$type=='-' & grepl('\\.zip$',ll$path)];
length(zips);
## [1] 6619
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With