I just wrote a simple package to solve exactly this problem: downloading a Google sheet using just the URL.
install.packages('gsheet')
library(gsheet)
gsheet2tbl('docs.google.com/spreadsheets/d/1I9mJsS5QnXF2TNNntTy-HrcdHmIF9wJ8ONYvEJTXSNo')
More detail is here: https://github.com/maxconway/gsheet
Use the googlesheets package, a Google Sheets R API by Jenny Bryan. It is the best way to analyze and edit Google Sheets data in R. Not only can it pull data from Google Sheets, but you can edit the data in Google Sheets, create new sheets, etc.
The package can be installed with install.packages("googlesheets")
.
There's a vignette for getting started; see her GitHub repository for more. And you also can install the latest development version of the package from that GitHub page, if desired.
I am working on a solution for this. Here is a function that works on your data as well as a few of my own Google Spreadsheets.
First, we need a function to read from Google sheets. readGoogleSheet()
will return a list of data frames, one for each table found on the Google sheet:
readGoogleSheet <- function(url, na.string="", header=TRUE){
stopifnot(require(XML))
# Suppress warnings because Google docs seems to have incomplete final line
suppressWarnings({
doc <- paste(readLines(url), collapse=" ")
})
if(nchar(doc) == 0) stop("No content found")
htmlTable <- gsub("^.*?(<table.*</table).*$", "\\1>", doc)
ret <- readHTMLTable(htmlTable, header=header, stringsAsFactors=FALSE, as.data.frame=TRUE)
lapply(ret, function(x){ x[ x == na.string] <- NA; x})
}
Next, we need a function to clean the individual tables. cleanGoogleTable()
removes empty lines inserted by Google, removes the row names (if they exist) and allows you to skip empty lines before the table starts:
cleanGoogleTable <- function(dat, table=1, skip=0, ncols=NA, nrows=-1, header=TRUE, dropFirstCol=NA){
if(!is.data.frame(dat)){
dat <- dat[[table]]
}
if(is.na(dropFirstCol)) {
firstCol <- na.omit(dat[[1]])
if(all(firstCol == ".") || all(firstCol== as.character(seq_along(firstCol)))) {
dat <- dat[, -1]
}
} else if(dropFirstCol) {
dat <- dat[, -1]
}
if(skip > 0){
dat <- dat[-seq_len(skip), ]
}
if(nrow(dat) == 1) return(dat)
if(nrow(dat) >= 2){
if(all(is.na(dat[2, ]))) dat <- dat[-2, ]
}
if(header && nrow(dat) > 1){
header <- as.character(dat[1, ])
names(dat) <- header
dat <- dat[-1, ]
}
# Keep only desired columns
if(!is.na(ncols)){
ncols <- min(ncols, ncol(dat))
dat <- dat[, seq_len(ncols)]
}
# Keep only desired rows
if(nrows > 0){
nrows <- min(nrows, nrow(dat))
dat <- dat[seq_len(nrows), ]
}
# Rename rows
rownames(dat) <- seq_len(nrow(dat))
dat
}
Now we are ready to read you Google sheet:
> u <- "https://docs.google.com/spreadsheets/d/0AmFzIcfgCzGFdHQ0eEU0MWZWV200RjgtTXVMY1NoQVE/pubhtml"
> g <- readGoogleSheet(u)
> cleanGoogleTable(g, table=1)
2012-Jan Mobile internet Tanzania
1 Airtel Zantel Vodacom Tigo TTCL Combined
> cleanGoogleTable(g, table=2, skip=1)
BUNDLE FEE VALIDITY MB Cost Sh/MB
1 Daily Bundle (20MB) 500/= 1 day 20 500 25.0
2 1 Day bundle (300MB) 3,000/= 1 day 300 3,000 10.0
3 Weekly bundle (3GB) 15,000/= 7 days 3,000 15,000 5.0
4 Monthly bundle (8GB) 70,000/= 30 days 8,000 70,000 8.8
5 Quarterly Bundle (24GB) 200,000/= 90 days 24,000 200,000 8.3
6 Yearly Bundle (96GB) 750,000/= 365 days 96,000 750,000 7.8
7 Handset Browsing Bundle(400 MB) 2,500/= 30 days 400 2,500 6.3
8 STANDARD <NA> <NA> 1 <NA> <NA>
Not sure if other use cases have a higher complexity or if something changed in the meantime. After publishing the spreadsheet in CSV format this simple 1-liner worked for me:
myCSV<-read.csv("http://docs.google.com/spreadsheets/d/1XKeAajiH47jAP0bPkCtS4OdOGTSsjleOXImDrFzxxZQ/pub?output=csv")
R version 3.3.2 (2016-10-31)
There is an easiest way to fetch the google sheets even if you're behind the proxy
require(RCurl)
fileUrl <- "https://docs.google.com/spreadsheets/d/[ID]/export?format=csv"
fileCSV <- getURL(fileUrl,.opts=list(ssl.verifypeer=FALSE))
fileCSVDF <- read.csv(textConnection(fileCSV))
A simpler way.
Be sure to match your URL carefully to the format of the example one here. You can get all but the /export?format=csv
piece from the Google Spreadsheets edit page. Then, just manually add this piece to the URL and then use as shown here.
library(RCurl)
library(mosaic)
mydat2 <- fetchGoogle(paste0("https://docs.google.com/spreadsheets/d/",
"1mAxpSTrjdFv1UrpxwDTpieVJP16R9vkSQrpHV8lVTA8/export?format=csv"))
mydat2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With