Get website directory listing in an R vector using RCurl

Question

I'm trying to get the list of files in a directory on a website. Is there a way to do this similar to the dir() or list.files() commands for local directory listing? I can connect to the website using RCurl (I need it because I need an SSL connection over HTTPS):

library(RCurl)    
text=getURL(*some https website*
,ssl.verifypeer = FALSE
,dirlistonly = TRUE)

But this creates an HTML file with images, hyperlinks, etc. of a list of files, but I just need an R vector of files as you would obtain with dir(). Is this possible? Or would I have to do HTML parsing to extract the filenames? Sounds like a complicated approach for a simple problem.

Thanks,

EDIT: if you can get it to work with http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencodeV7/ then you'll see what I mean.

Dag Hjermann · Accepted Answer

This is last example in the help file for getURL (with an updated URL):

url <- 'ftp://speedtest.tele2.net/'
filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE)


# Deal with newlines as 
 or 
. (BDR)
# Or alternatively, instruct libcurl to change 
’s to 
’s for us with crlf = TRUE
# filenames = getURL(url, ftp.use.epsv = FALSE, ftplistonly = TRUE, crlf = TRUE)
filenames = paste(url, strsplit(filenames, "
*
")[[1]], sep = "")

Does that solve your problem?

Sfroehlich · Answer

Try this:

   library(RCurl)

   dir_list <-
     read.table(
       textConnection(
         getURLContent(ftp://[...]/)
       )
     sep = "",
     strip.white = TRUE)

The resulting table separates the date into 3 text fields, but it is a big start and you can get the filenames.

Rachel Gallen · Answer

I was reading a RCurl document and came across a new piece of code:

stockReader =
function()
{
values <- numeric() # to which the data is appended when received
# Function that appends the values to the centrally stored vector
read = function(chunk) {
con = textConnection(chunk)
on.exit(close(con))
tmp = scan(con)
values <<- c(values, tmp)
}
list(read = read,
values = function() values # accessor to get result on completion
)
}

followed by

reader = stockReader()
getURL(’http://www.omegahat.org/RCurl/stockExample.dat’,
write = reader$read)
reader$values()

it says 'numeric' in the sample but surely this code sample can be adapted? Read the attached document. I'm sure you will find what you're looking for.

It also says

The basic use of getURL(), getForm() and postForm() returns the contents of the requested document as a single block of text. It is accumulated by the libcurl facilities and combined into a single string. We then typically traverse the contents of the document to extract the information into regular data, e.g. vectors and data frames. For example, suppose the document we requested is a simple stream of numbers such as prices of a particular stock at diﬀerent time points. We would download the contents of the ﬁle, and then read it into a vector in R so that we could analyze the values. Unfortunately, this results in essentially two copies of the data residing in memory simultaneously. This can be prohibitive or at least undesirable for large datasets. An alternative approach is to process the data in chunks as it is received by libcurl. If we can be notiﬁed each time libcurl receives data from the reply and do something meaningful with the data, then we need not accumulate the chunks. The largest extra piece of information we will need to have is the largest chunk. In our example, we could take each chunk and pass it to the scan() function to turn the values into a vector. Then we can concatenate this with the vector from the previously processed chunks.

Get website directory listing in an R vector using RCurl

Tags:

r

rcurl

FBC

3 Answers

Dag Hjermann

Sfroehlich

Rachel Gallen

Recent Activity

Donate For Us

Get website directory listing in an R vector using RCurl

Tags:

r

rcurl

FBC

3 Answers

Dag Hjermann

Sfroehlich

Rachel Gallen

Related questions

Recent Activity

Donate For Us