Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing data into R from google spreadsheet

I just wrote a simple package to solve exactly this problem: downloading a Google sheet using just the URL.

install.packages('gsheet')
library(gsheet)
gsheet2tbl('docs.google.com/spreadsheets/d/1I9mJsS5QnXF2TNNntTy-HrcdHmIF9wJ8ONYvEJTXSNo')

More detail is here: https://github.com/maxconway/gsheet


Use the googlesheets package, a Google Sheets R API by Jenny Bryan. It is the best way to analyze and edit Google Sheets data in R. Not only can it pull data from Google Sheets, but you can edit the data in Google Sheets, create new sheets, etc.

The package can be installed with install.packages("googlesheets").

There's a vignette for getting started; see her GitHub repository for more. And you also can install the latest development version of the package from that GitHub page, if desired.


I am working on a solution for this. Here is a function that works on your data as well as a few of my own Google Spreadsheets.

First, we need a function to read from Google sheets. readGoogleSheet() will return a list of data frames, one for each table found on the Google sheet:

readGoogleSheet <- function(url, na.string="", header=TRUE){
  stopifnot(require(XML))
  # Suppress warnings because Google docs seems to have incomplete final line
  suppressWarnings({
    doc <- paste(readLines(url), collapse=" ")
  })
  if(nchar(doc) == 0) stop("No content found")
  htmlTable <- gsub("^.*?(<table.*</table).*$", "\\1>", doc)
  ret <- readHTMLTable(htmlTable, header=header, stringsAsFactors=FALSE, as.data.frame=TRUE)
  lapply(ret, function(x){ x[ x == na.string] <- NA; x})
}

Next, we need a function to clean the individual tables. cleanGoogleTable() removes empty lines inserted by Google, removes the row names (if they exist) and allows you to skip empty lines before the table starts:

cleanGoogleTable <- function(dat, table=1, skip=0, ncols=NA, nrows=-1, header=TRUE, dropFirstCol=NA){
  if(!is.data.frame(dat)){
    dat <- dat[[table]]
  }

  if(is.na(dropFirstCol)) {
    firstCol <- na.omit(dat[[1]])
    if(all(firstCol == ".") || all(firstCol== as.character(seq_along(firstCol)))) {
      dat <- dat[, -1]
    }
  } else if(dropFirstCol) {
    dat <- dat[, -1]
  }

  if(skip > 0){
    dat <- dat[-seq_len(skip), ]
  }

  if(nrow(dat) == 1) return(dat)


  if(nrow(dat) >= 2){
    if(all(is.na(dat[2, ]))) dat <- dat[-2, ]
  }
  if(header && nrow(dat) > 1){
    header <- as.character(dat[1, ])
    names(dat) <- header
    dat <- dat[-1, ]
  }

  # Keep only desired columns
  if(!is.na(ncols)){
    ncols <- min(ncols, ncol(dat))
    dat <- dat[, seq_len(ncols)]
  }


  # Keep only desired rows
  if(nrows > 0){
    nrows <- min(nrows, nrow(dat))
    dat <- dat[seq_len(nrows), ]
  }

  # Rename rows
  rownames(dat) <- seq_len(nrow(dat))
  dat
}

Now we are ready to read you Google sheet:

> u <- "https://docs.google.com/spreadsheets/d/0AmFzIcfgCzGFdHQ0eEU0MWZWV200RjgtTXVMY1NoQVE/pubhtml"
> g <- readGoogleSheet(u)
> cleanGoogleTable(g, table=1)


         2012-Jan Mobile internet Tanzania
1 Airtel Zantel Vodacom Tigo TTCL Combined


> cleanGoogleTable(g, table=2, skip=1)

                           BUNDLE       FEE VALIDITY     MB    Cost Sh/MB
1             Daily Bundle (20MB)     500/=    1 day     20     500  25.0
2            1 Day bundle (300MB)   3,000/=    1 day    300   3,000  10.0
3             Weekly bundle (3GB)  15,000/=   7 days  3,000  15,000   5.0
4            Monthly bundle (8GB)  70,000/=  30 days  8,000  70,000   8.8
5         Quarterly Bundle (24GB) 200,000/=  90 days 24,000 200,000   8.3
6            Yearly Bundle (96GB) 750,000/= 365 days 96,000 750,000   7.8
7 Handset Browsing Bundle(400 MB)   2,500/=  30 days    400   2,500   6.3
8                        STANDARD      <NA>     <NA>      1    <NA>  <NA>

Not sure if other use cases have a higher complexity or if something changed in the meantime. After publishing the spreadsheet in CSV format this simple 1-liner worked for me:

myCSV<-read.csv("http://docs.google.com/spreadsheets/d/1XKeAajiH47jAP0bPkCtS4OdOGTSsjleOXImDrFzxxZQ/pub?output=csv")

R version 3.3.2 (2016-10-31)


There is an easiest way to fetch the google sheets even if you're behind the proxy

require(RCurl)
fileUrl <- "https://docs.google.com/spreadsheets/d/[ID]/export?format=csv"
fileCSV <- getURL(fileUrl,.opts=list(ssl.verifypeer=FALSE))
fileCSVDF <-  read.csv(textConnection(fileCSV))

A simpler way.

Be sure to match your URL carefully to the format of the example one here. You can get all but the /export?format=csv piece from the Google Spreadsheets edit page. Then, just manually add this piece to the URL and then use as shown here.

library(RCurl)
library(mosaic)
mydat2 <- fetchGoogle(paste0("https://docs.google.com/spreadsheets/d/",
  "1mAxpSTrjdFv1UrpxwDTpieVJP16R9vkSQrpHV8lVTA8/export?format=csv"))
mydat2