Load a table from wikipedia into R

Question

I'm trying to load the table of Supreme Court Justices into R from the following URL. https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States

I'm using the following code:

scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"
scotusData <- getURL(scotusURL, ssl.verifypeer = FALSE)
scotusDoc <- htmlParse(scotusData)
scotusData <- scotusDoc['//table[@class="wikitable"]']
scotusTable <- readHTMLTable(scotusData[[1]], stringsAsFactors = FALSE)

R returns scotusTable as NULL. The goal here is to get a data.frame in R that I can use to make a ggplot of SCOTUS justice tenure on the Court. I previously had the script working to make an awesome plot, however after the recent decisions something changed on the page and now the script will not function. I went through the HTML on wikipedia to try to find any changes, however I'm not a webdev so anything that would break my script isn't immediately apparent.

Additionally, is there a method in R that would allow me to cache the data from this page so I'm not constantly referencing the URL? That would seem to be the ideal way to avoid this issue in the future. Appreciate the help.

As an aside, SCOTUS in an on-going hobby/side-project of mine so if there's some other data source out there that's better than wikipedia, I'm all ears.

Edit: Sorry I should have listed my dependencies. I'm using the XML, plyr, RCurl, data.table, and ggplot2 libraries.

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer

If you don't mind using a different package, you can try the "rvest" package.

library(rvest)    
scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"

Option 1: Grab the tables from the page and use the html_table function to extract the tables you're interested in.

temp <- scotusURL %>% 
  html %>%
  html_nodes("table")

html_table(temp[1]) ## Just the "legend" table
html_table(temp[2]) ## The table you're interested in

Option 2: Inspect the table element and copy the XPath to read that table directly (right-click, inspect element, scroll to the relevant "table" tag, right click on that, and select "Copy XPath").
```
scotusURL %>% 
  html %>% 
  html_nodes(xpath = '//*[@id="mw-content-text"]/table[2]') %>% 
  html_table
```

Another option I like is loading the data in a Google spreadsheet and reading it using the "googlesheets" package.

In Google Drive, create a new spreadsheet named, for instance "Supreme Court". In the first worksheet, enter:

=importhtml("https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States", "table", 2)

This will automatically scrape this table into your Google spreadsheet.

From there, in R you can do:

library(googlesheets)
SC <- gs_title("Supreme Court")
gs_read(SC)

Load a table from wikipedia into R

Tags:

r

xml

html-parsing

data.table

Benjamin Scott

1 Answers

A5C1D2H2I1M1N2O1R2T1

Recent Activity

Donate For Us

Load a table from wikipedia into R

Tags:

r

xml

html-parsing

data.table

Benjamin Scott

1 Answers

A5C1D2H2I1M1N2O1R2T1

Related questions

Recent Activity

Donate For Us