Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load a table from wikipedia into R

I'm trying to load the table of Supreme Court Justices into R from the following URL. https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States

I'm using the following code:

scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"
scotusData <- getURL(scotusURL, ssl.verifypeer = FALSE)
scotusDoc <- htmlParse(scotusData)
scotusData <- scotusDoc['//table[@class="wikitable"]']
scotusTable <- readHTMLTable(scotusData[[1]], stringsAsFactors = FALSE)

R returns scotusTable as NULL. The goal here is to get a data.frame in R that I can use to make a ggplot of SCOTUS justice tenure on the Court. I previously had the script working to make an awesome plot, however after the recent decisions something changed on the page and now the script will not function. I went through the HTML on wikipedia to try to find any changes, however I'm not a webdev so anything that would break my script isn't immediately apparent.

Additionally, is there a method in R that would allow me to cache the data from this page so I'm not constantly referencing the URL? That would seem to be the ideal way to avoid this issue in the future. Appreciate the help.

As an aside, SCOTUS in an on-going hobby/side-project of mine so if there's some other data source out there that's better than wikipedia, I'm all ears.

Edit: Sorry I should have listed my dependencies. I'm using the XML, plyr, RCurl, data.table, and ggplot2 libraries.

like image 389
Benjamin Scott Avatar asked Dec 05 '22 03:12

Benjamin Scott


1 Answers

If you don't mind using a different package, you can try the "rvest" package.

library(rvest)    
scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"
  • Option 1: Grab the tables from the page and use the html_table function to extract the tables you're interested in.

    temp <- scotusURL %>% 
      html %>%
      html_nodes("table")
    
    html_table(temp[1]) ## Just the "legend" table
    html_table(temp[2]) ## The table you're interested in
    
  • Option 2: Inspect the table element and copy the XPath to read that table directly (right-click, inspect element, scroll to the relevant "table" tag, right click on that, and select "Copy XPath").

    scotusURL %>% 
      html %>% 
      html_nodes(xpath = '//*[@id="mw-content-text"]/table[2]') %>% 
      html_table
    

Another option I like is loading the data in a Google spreadsheet and reading it using the "googlesheets" package.

In Google Drive, create a new spreadsheet named, for instance "Supreme Court". In the first worksheet, enter:

=importhtml("https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States", "table", 2)

This will automatically scrape this table into your Google spreadsheet.

From there, in R you can do:

library(googlesheets)
SC <- gs_title("Supreme Court")
gs_read(SC)
like image 137
A5C1D2H2I1M1N2O1R2T1 Avatar answered Dec 18 '22 07:12

A5C1D2H2I1M1N2O1R2T1