Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read list of file names from web into R

Tags:

dataframe

r

I am trying to read a lot of csv files into R from a website. Threa are multiple years of daily (business days only) files. All of the files have the same data structure. I can sucessfully read one file using the following logic:

# enter user credentials
user     <- "JohnDoe"
password <- "SecretPassword"
credentials <- paste(user,":",password,"@",sep="")
web.site <- "downloads.theice.com/Settlement_Reports_CSV/Power/"

# construct path to data
path <- paste("https://", credentials, web.site, sep="")

# read data for 4/10/2013
file  <- "icecleared_power_2013_04_10"
fname <- paste(path,file,".dat",sep="")
df <- read.csv(fname,header=TRUE, sep="|",as.is=TRUE)

However, Im looking for tips on how to read all the files in the directory at once. I suppose I could generate a sequence of dates an construct the file name above in a loop and use rbind to append each file but that seems cumbersome. Plus there will be issues when attempting to read weekends and holidays where there is no files.

The impages below show what the list of files look like in the web browser:

file list in browser part 1

... ... ...

file list in browser part 2

Is there a way to scan the path (from above) to get a list of all the file names in the directory first that meet a certin crieteia (i.e. start with "icecleared_power_" as there are also some files in that location that have a different starting name that I do not want to read in) then loop the read.csv through that list and use rbind to append?

Any guidance would be greatly appreciated?

like image 756
MikeTP Avatar asked Apr 11 '13 16:04

MikeTP


2 Answers

I would first try to just scrape the links to the relevant data files and use the resulting information to construct the full download path that includes user logins and so on. As others have suggested, lapply would be convenient for batch downloading.

Here's an easy way to extract the URLs. Obviously, modify the example to suit your actual scenario.

Here, we're going to use the XML package to identify all the links available at the CRAN archives for the Amelia package (http://cran.r-project.org/src/contrib/Archive/Amelia/).

> library(XML)
> url <- "http://cran.r-project.org/src/contrib/Archive/Amelia/"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/@href")
> free(doc)
> links
                   href                    href                    href 
             "?C=N;O=D"              "?C=M;O=A"              "?C=S;O=A" 
                   href                    href                    href 
             "?C=D;O=A" "/src/contrib/Archive/"  "Amelia_1.1-23.tar.gz" 
                   href                    href                    href 
 "Amelia_1.1-29.tar.gz"  "Amelia_1.1-30.tar.gz"  "Amelia_1.1-32.tar.gz" 
                   href                    href                    href 
 "Amelia_1.1-33.tar.gz"   "Amelia_1.2-0.tar.gz"   "Amelia_1.2-1.tar.gz" 
                   href                    href                    href 
  "Amelia_1.2-2.tar.gz"   "Amelia_1.2-9.tar.gz"  "Amelia_1.2-12.tar.gz" 
                   href                    href                    href 
 "Amelia_1.2-13.tar.gz"  "Amelia_1.2-14.tar.gz"  "Amelia_1.2-15.tar.gz" 
                   href                    href                    href 
 "Amelia_1.2-16.tar.gz"  "Amelia_1.2-17.tar.gz"  "Amelia_1.2-18.tar.gz" 
                   href                    href                    href 
  "Amelia_1.5-4.tar.gz"   "Amelia_1.5-5.tar.gz"   "Amelia_1.6.1.tar.gz" 
                   href                    href                    href 
  "Amelia_1.6.3.tar.gz"   "Amelia_1.6.4.tar.gz"     "Amelia_1.7.tar.gz" 

For the sake of demonstration, imagine that, ultimately, we only want the links for the 1.2 versions of the package.

> wanted <- links[grepl("Amelia_1\\.2.*", links)]
> wanted
                  href                   href                   href 
 "Amelia_1.2-0.tar.gz"  "Amelia_1.2-1.tar.gz"  "Amelia_1.2-2.tar.gz" 
                  href                   href                   href 
 "Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz" "Amelia_1.2-13.tar.gz" 
                  href                   href                   href 
"Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz" "Amelia_1.2-16.tar.gz" 
                  href                   href 
"Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz" 

You can now use that vector as follows:

wanted <- links[grepl("Amelia_1\\.2.*", links)]
GetMe <- paste(url, wanted, sep = "")
lapply(seq_along(GetMe), 
       function(x) download.file(GetMe[x], wanted[x], mode = "wb"))

Update (to clarify your question in comments)

The last step in the example above downloads the specified files to your current working directory (use getwd() to verify where that is). If, instead, you know for sure that read.csv works on the data, you can also try to modify your anonymous function to read the files directly:

lapply(seq_along(GetMe), 
       function(x) read.csv(GetMe[x], header = TRUE, sep = "|", as.is = TRUE))

However, I think a safer approach might be to download all the files into a single directory first, and then use read.delim or read.csv or whatever works to read in the data, similar to as was suggested by @Andreas. I say safer because it gives you more flexibility in case files aren't fully downloaded and so on. In that case, instead of having to redownload everything, you would only need to download the files which were not fully downloaded.

like image 52
A5C1D2H2I1M1N2O1R2T1 Avatar answered Nov 15 '22 03:11

A5C1D2H2I1M1N2O1R2T1


@MikeTP, if all the reports start with "icecleared_power_" and a date which is a business date the package "timeDate" offers an easy way to create a vector of business dates, like so:

require(timeDate)
tSeq <- timeSequence("2012-01-01","2012-12-31") # vector of days
tBiz <- tSeq[isBizday(tSeq)] # vector of business days

and

paste0("icecleared_power_",as.character.Date(tBiz))

gives you the concatenated file name.

If the web site follows a different logic regarding the naming of files we need more information as Ananda Mahto observed.

Keep in mind that when you create a date vector with timeDate you can get much more sophisticated then my simple example. You can take into account holiday schedules, stock exchange dates etc.

like image 27
hvollmeier Avatar answered Nov 15 '22 03:11

hvollmeier