I am trying to read a lot of csv files into R from a website. Threa are multiple years of daily (business days only) files. All of the files have the same data structure. I can sucessfully read one file using the following logic:
# enter user credentials
user <- "JohnDoe"
password <- "SecretPassword"
credentials <- paste(user,":",password,"@",sep="")
web.site <- "downloads.theice.com/Settlement_Reports_CSV/Power/"
# construct path to data
path <- paste("https://", credentials, web.site, sep="")
# read data for 4/10/2013
file <- "icecleared_power_2013_04_10"
fname <- paste(path,file,".dat",sep="")
df <- read.csv(fname,header=TRUE, sep="|",as.is=TRUE)
However, Im looking for tips on how to read all the files in the directory at once. I suppose I could generate a sequence of dates an construct the file name above in a loop and use rbind to append each file but that seems cumbersome. Plus there will be issues when attempting to read weekends and holidays where there is no files.
The impages below show what the list of files look like in the web browser:
... ... ...
Is there a way to scan the path (from above) to get a list of all the file names in the directory first that meet a certin crieteia (i.e. start with "icecleared_power_" as there are also some files in that location that have a different starting name that I do not want to read in) then loop the read.csv through that list and use rbind to append?
Any guidance would be greatly appreciated?
I would first try to just scrape the links to the relevant data files and use the resulting information to construct the full download path that includes user logins and so on. As others have suggested, lapply
would be convenient for batch downloading.
Here's an easy way to extract the URLs. Obviously, modify the example to suit your actual scenario.
Here, we're going to use the XML
package to identify all the links available at the CRAN archives for the Amelia package (http://cran.r-project.org/src/contrib/Archive/Amelia/).
> library(XML)
> url <- "http://cran.r-project.org/src/contrib/Archive/Amelia/"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/@href")
> free(doc)
> links
href href href
"?C=N;O=D" "?C=M;O=A" "?C=S;O=A"
href href href
"?C=D;O=A" "/src/contrib/Archive/" "Amelia_1.1-23.tar.gz"
href href href
"Amelia_1.1-29.tar.gz" "Amelia_1.1-30.tar.gz" "Amelia_1.1-32.tar.gz"
href href href
"Amelia_1.1-33.tar.gz" "Amelia_1.2-0.tar.gz" "Amelia_1.2-1.tar.gz"
href href href
"Amelia_1.2-2.tar.gz" "Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz"
href href href
"Amelia_1.2-13.tar.gz" "Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz"
href href href
"Amelia_1.2-16.tar.gz" "Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz"
href href href
"Amelia_1.5-4.tar.gz" "Amelia_1.5-5.tar.gz" "Amelia_1.6.1.tar.gz"
href href href
"Amelia_1.6.3.tar.gz" "Amelia_1.6.4.tar.gz" "Amelia_1.7.tar.gz"
For the sake of demonstration, imagine that, ultimately, we only want the links for the 1.2 versions of the package.
> wanted <- links[grepl("Amelia_1\\.2.*", links)]
> wanted
href href href
"Amelia_1.2-0.tar.gz" "Amelia_1.2-1.tar.gz" "Amelia_1.2-2.tar.gz"
href href href
"Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz" "Amelia_1.2-13.tar.gz"
href href href
"Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz" "Amelia_1.2-16.tar.gz"
href href
"Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz"
You can now use that vector as follows:
wanted <- links[grepl("Amelia_1\\.2.*", links)]
GetMe <- paste(url, wanted, sep = "")
lapply(seq_along(GetMe),
function(x) download.file(GetMe[x], wanted[x], mode = "wb"))
The last step in the example above downloads the specified files to your current working directory (use getwd()
to verify where that is). If, instead, you know for sure that read.csv
works on the data, you can also try to modify your anonymous function to read the files directly:
lapply(seq_along(GetMe),
function(x) read.csv(GetMe[x], header = TRUE, sep = "|", as.is = TRUE))
However, I think a safer approach might be to download all the files into a single directory first, and then use read.delim
or read.csv
or whatever works to read in the data, similar to as was suggested by @Andreas. I say safer because it gives you more flexibility in case files aren't fully downloaded and so on. In that case, instead of having to redownload everything, you would only need to download the files which were not fully downloaded.
@MikeTP, if all the reports start with "icecleared_power_" and a date which is a business date the package "timeDate" offers an easy way to create a vector of business dates, like so:
require(timeDate)
tSeq <- timeSequence("2012-01-01","2012-12-31") # vector of days
tBiz <- tSeq[isBizday(tSeq)] # vector of business days
and
paste0("icecleared_power_",as.character.Date(tBiz))
gives you the concatenated file name.
If the web site follows a different logic regarding the naming of files we need more information as Ananda Mahto observed.
Keep in mind that when you create a date vector with timeDate you can get much more sophisticated then my simple example. You can take into account holiday schedules, stock exchange dates etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With