Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accessing google docs revision history through the API using R?

I wish to download and analyse the revision history of one of my google docs using R, finding out statistics like how many edits did whom.

I see that there are already some ways for accessing google docs using R.

Does anyone know upfront (before I go ahead and try to hack my way), if or how it might be (reasonably easily) done?

Thanks.

like image 434
Tal Galili Avatar asked Mar 01 '12 19:03

Tal Galili


People also ask

Can you view revision history in Google Docs?

On your computer, open a document, spreadsheet, or presentation. See version history. Choose a previous version. You can find who updated the file and their changes.


2 Answers

The googledrive package includes some low-level API functions that can be used for this. For example, here's how we can get a list of revisions for one Google doc:

library(googledrive)
library(tidyverse)

# replace this with the ID of your google doc
# this doc is private, it wont work for you
fileId <- "1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps"

# Get the name of the file and some other metadata
file <- build_request(
  path = "drive/v3/files/{fileId}",
  method = "GET",
  params = list(
    fileId = fileId,
    fields = "*"
  ),
  token = drive_token()
)
file_ret <-  process_response(make_request(file))

# Now for this doc, query the Drive API to get get URLs and other meta-data for all the revisions available to us

req2 <- build_request(
  path = "drive/v2/files/{fileId}/revisions",
  method = "GET",
  params = list(
    fileId = fileId
  ),
  token = drive_token()
)
revs2 <-  process_response(make_request(req2))

# See 
# https://developers.google.com/drive/api/v2/reference/revisions#resource
# for an explanation of each variable that we have here

# tidy revisions into a dataframe
revs2_df <-
  map_df(
    revs2$items,
    `[`,
    c(
      "kind",
      "etag" ,
      "id",
      "selfLink"   ,
      "mimeType"     ,
      "modifiedDate",
      "published"   ,
      "lastModifyingUserName"
    )
  )
# get exportLinks URLs out of its nest
revs2_export_url <- map_df(revs2$items, "exportLinks")
# bind together
revs2_df_bind <- bind_cols(revs2_df, revs2_export_url)

The result includes, for each revision, the date, time, name of the user that did the revision, and URLs to export that revision into a download file:

# A tibble: 140 x 16
   kind   etag  id    selfLink mimeType modifiedDate published lastModifyingUs… `application/rt…
   <chr>  <chr> <chr> <chr>    <chr>    <chr>        <lgl>     <chr>            <chr>           
 1 drive… "\"H… 28367 https:/… applica… 2017-09-12T… FALSE     Gayoung Park     https://docs.go…
 2 drive… "\"H… 28487 https:/… applica… 2017-09-12T… FALSE     Gayoung Park     https://docs.go…
 3 drive… "\"H… 28862 https:/… applica… 2017-09-13T… FALSE     Gayoung Park     https://docs.go…
 4 drive… "\"H… 29221 https:/… applica… 2017-09-13T… FALSE     Gayoung Park     https://docs.go…
 5 drive… "\"H… 29258 https:/… applica… 2017-09-13T… FALSE     Gayoung Park     https://docs.go…
 6 drive… "\"H… 29434 https:/… applica… 2017-09-13T… FALSE     Gayoung Park     https://docs.go…
 7 drive… "\"H… 29454 https:/… applica… 2017-09-18T… FALSE     Gayoung Park     https://docs.go…
 8 drive… "\"H… 29603 https:/… applica… 2017-09-18T… FALSE     Gayoung Park     https://docs.go…
 9 drive… "\"H… 30108 https:/… applica… 2017-09-18T… FALSE     Gayoung Park     https://docs.go…
10 drive… "\"H… 30115 https:/… applica… 2017-09-21T… FALSE     Gayoung Park     https://docs.go…
# ... with 130 more rows, and 7 more variables: `application/vnd.oasis.opendocument.text` <chr>,
#   `text/html` <chr>, `application/pdf` <chr>, `application/epub+zip` <chr>,
#   `application/zip` <chr>,
#   `application/vnd.openxmlformats-officedocument.wordprocessingml.document` <chr>,
#   `text/plain` <chr>

We can then loop over the export URLs to download all the revisions, and compare the size or word count or whatever, and eventually get some plots like this:

Full code for those plots is here: https://gist.github.com/benmarwick/1feaa2b2f0d7bc5f7e97903b8ff92aed

Be aware that there are some severe limitations to the Google Drive revision history that is available via the API. For example,

  • When many users edit at the same time, we only get the name of the first editor active in that session. The others are not captured.
  • When many edits happen in a short period of time, these are combined by Google into one revision, and we cannot see them individually. We do not have good temporal resolution.
  • Older edits are deleted by Google to save space. We do not know what their rules are for this.
like image 115
Ben Avatar answered Sep 20 '22 22:09

Ben


The Google Documents List API gives you access to a list of revisions, the contents of each revision, a way to make sure that a revision is stored for a given update, etc.

The list of revisions may give you what you're looking for, but you'll need to be more specific on what statistics you want. Note that it seems that history data on who made what change can be a bit incomplete - see API problems with revision history - Google Documents List API | Google Groups

like image 39
nealmcb Avatar answered Sep 19 '22 22:09

nealmcb