Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rvest read table with cells that span multiple rows

I'm trying to scrape an irregular table from Wikipedia using rvest. The table has cells that span multiple rows. The documentation for html_table clearly states that this is a limitation. I'm just wondering if there's a workaround.

The table looks like this: enter image description here

My code:

library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
parks <- url %>%
  read_html() %>%
  html_nodes(xpath='/html/body/div[3]/div[3]/div[4]/div/table[2]') %>%
  html_table(fill=TRUE) %>%  # fill=FALSE yields the same results
  .[[1]]

Returns this:

enter image description here

Where there are several errors, for example: row 4 under "City" should be "Mesa", NOT "Chicago Cubs". I'd be happy with blank cells as I could "fill down" as needed, but the wrong data is a problem. Help is much appreciated.

like image 763
cory Avatar asked Jul 30 '19 19:07

cory


1 Answers

I have a way to code it. It is not perfect, a bit long but it does the trick:

library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"

# get the lines of the table
lines <- url %>%
  read_html() %>%
  html_nodes(xpath="//table[starts-with(@class, 'wikitable')]") %>%
  html_nodes(xpath = 'tbody/tr')

#define the empty table
ncol <-  lines %>%
  .[[1]] %>%
  html_children()%>%
  length()
nrow <- length(lines)
table <- as.data.frame(matrix(nrow = nrow,ncol = ncol))
   
# fill the table
for(i in 1:nrow){
  # get content of the line
  linecontent <- lines[[i]]%>%
    html_children()%>%
    html_text()%>%
    gsub("\n","",.)
  
  # attribute the content to free columns
  colselect <- is.na(table[i,])
  table[i,colselect] <- linecontent
    
# get the line repetition of each columns
  repetition <- lines[[i]]%>%
    html_children()%>%
    html_attr("rowspan")%>%
    ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
    as.numeric
  
 # repeat the cells of the multiple rows down
  for(j in 1:length(repetition)){
    span <- repetition[j]
    if(span > 1){
      table[(i+1):(i+span-1),colselect][,j] <- rep(linecontent[j],span-1)
    }
  }
}

The idea is to have the html lines of the table in the lines variable by getting the /tr nodes. I then create an empty table: number of columns is the length of the children of the first row (because it contains the titles), number of line the length of lines. I fill it by hand in a for loop (didn't amanger a nicer way here).

The difficulty is that the amount of column text given in a row changes when there is already a multiple row column spanning on the current row. For example :

  lines[[3]]%>%
    html_children()%>%
    html_text()%>%
    gsub("\n","",.)

gives only 5 values :

[1] "Arizona League Athletics Gold" "Oakland Athletics"             "Mesa"                          "Fitch Park"                   
[5] "10,000"  

instead of the 6 columns, because the first column is East on 8 rows. This East value appears only on the first rows it spans on.

The trick is to repeat the cells down in the table when they have a rowspan attribute (meaning they span on several rows). It allows to select on the next row only the NA columns, so that the amount of text given by the html line match the amount of free columns in the table we fill.

This is done with the colselect variable, which is a bolean giving the free rows before repeting the cells of the given row.

The result :

         V1                             V2                   V3         V4                                 V5       V6
1  Division                           Team      MLB Affiliation       City                            Stadium Capacity
2      East          Arizona League Angels   Los Angeles Angels      Tempe               Tempe Diablo Stadium    9,785
3      East  Arizona League Athletics Gold    Oakland Athletics       Mesa                         Fitch Park   10,000
4      East Arizona League Athletics Green    Oakland Athletics       Mesa                         Fitch Park   10,000
5      East          Arizona League Cubs 1         Chicago Cubs       Mesa                         Sloan Park   15,000
6      East          Arizona League Cubs 2         Chicago Cubs       Mesa                         Sloan Park   15,000
7      East    Arizona League Diamondbacks Arizona Diamondbacks Scottsdale Salt River Fields at Talking Stick   11,000
8      East    Arizona League Giants Black San Francisco Giants Scottsdale                 Scottsdale Stadium   12,000
9      East   Arizona League Giants Orange San Francisco Giants Scottsdale                 Scottsdale Stadium   12,000
10  Central    Arizona League Brewers Gold    Milwaukee Brewers    Phoenix  American Family Fields of Phoenix    8,000
11  Central Arizona League Dodgers Lasorda  Los Angeles Dodgers    Phoenix                    Camelback Ranch   12,000
12  Central    Arizona League Indians Blue    Cleveland Indians   Goodyear                  Goodyear Ballpark   10,000
13  Central        Arizona League Padres 2     San Diego Padres     Peoria              Peoria Sports Complex   12,882
14  Central            Arizona League Reds      Cincinnati Reds   Goodyear                  Goodyear Ballpark   10,000
15  Central       Arizona League White Sox    Chicago White Sox    Phoenix                    Camelback Ranch   12,000
16     West    Arizona League Brewers Blue    Milwaukee Brewers    Phoenix  American Family Fields of Phoenix    8,000
17     West    Arizona League Dodgers Mota  Los Angeles Dodgers    Phoenix                    Camelback Ranch   12,000
18     West     Arizona League Indians Red    Cleveland Indians   Goodyear                  Goodyear Ballpark   10,000
19     West        Arizona League Mariners     Seattle Mariners     Peoria              Peoria Sports Complex   12,882
20     West        Arizona League Padres 1     San Diego Padres     Peoria              Peoria Sports Complex   12,882
21     West         Arizona League Rangers        Texas Rangers   Surprise                   Surprise Stadium   10,500
22     West          Arizona League Royals   Kansas City Royals   Surprise                   Surprise Stadium   10,500

Edit

I made a shorter version of the function, with more explanation here

like image 100
denis Avatar answered Oct 20 '22 14:10

denis