Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get website directory listing in an R vector using RCurl

Tags:

r

rcurl

I'm trying to get the list of files in a directory on a website. Is there a way to do this similar to the dir() or list.files() commands for local directory listing? I can connect to the website using RCurl (I need it because I need an SSL connection over HTTPS):

library(RCurl)    
text=getURL(*some https website*
,ssl.verifypeer = FALSE
,dirlistonly = TRUE)

But this creates an HTML file with images, hyperlinks, etc. of a list of files, but I just need an R vector of files as you would obtain with dir(). Is this possible? Or would I have to do HTML parsing to extract the filenames? Sounds like a complicated approach for a simple problem.

Thanks,

EDIT: if you can get it to work with http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencodeV7/ then you'll see what I mean.

like image 509
FBC Avatar asked May 22 '13 19:05

FBC


3 Answers

This is last example in the help file for getURL (with an updated URL):

url <- 'ftp://speedtest.tele2.net/'
filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE)


# Deal with newlines as \n or \r\n. (BDR)
# Or alternatively, instruct libcurl to change \n’s to \r\n’s for us with crlf = TRUE
# filenames = getURL(url, ftp.use.epsv = FALSE, ftplistonly = TRUE, crlf = TRUE)
filenames = paste(url, strsplit(filenames, "\r*\n")[[1]], sep = "")

Does that solve your problem?

like image 137
Dag Hjermann Avatar answered Nov 08 '22 01:11

Dag Hjermann


Try this:

   library(RCurl)

   dir_list <-
     read.table(
       textConnection(
         getURLContent(ftp://[...]/)
       )
     sep = "",
     strip.white = TRUE)

The resulting table separates the date into 3 text fields, but it is a big start and you can get the filenames.

like image 37
Sfroehlich Avatar answered Nov 07 '22 23:11

Sfroehlich


I was reading a RCurl document and came across a new piece of code:

stockReader =
function()
{
values <- numeric() # to which the data is appended when received
# Function that appends the values to the centrally stored vector
read = function(chunk) {
con = textConnection(chunk)
on.exit(close(con))
tmp = scan(con)
values <<- c(values, tmp)
}
list(read = read,
values = function() values # accessor to get result on completion
)
}

followed by

reader = stockReader()
getURL(’http://www.omegahat.org/RCurl/stockExample.dat’,
write = reader$read)
reader$values()

it says 'numeric' in the sample but surely this code sample can be adapted? Read the attached document. I'm sure you will find what you're looking for.

It also says

The basic use of getURL(), getForm() and postForm() returns the contents of the requested document as a single block of text. It is accumulated by the libcurl facilities and combined into a single string. We then typically traverse the contents of the document to extract the information into regular data, e.g. vectors and data frames. For example, suppose the document we requested is a simple stream of numbers such as prices of a particular stock at different time points. We would download the contents of the file, and then read it into a vector in R so that we could analyze the values. Unfortunately, this results in essentially two copies of the data residing in memory simultaneously. This can be prohibitive or at least undesirable for large datasets. An alternative approach is to process the data in chunks as it is received by libcurl. If we can be notified each time libcurl receives data from the reply and do something meaningful with the data, then we need not accumulate the chunks. The largest extra piece of information we will need to have is the largest chunk. In our example, we could take each chunk and pass it to the scan() function to turn the values into a vector. Then we can concatenate this with the vector from the previously processed chunks.

like image 1
Rachel Gallen Avatar answered Nov 07 '22 23:11

Rachel Gallen