Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extract a specific table from wikipedia in R

Tags:

r

I want to extract the 20th table from a Wikipedia page https://en.wikipedia.org/wiki/...

I now use this code, but it only extracts the first heading table.

the_url <- "https://en.wikipedia.org/wiki/..."
tb <- the_url %>% read_html() %>% 
  html_node("table") %>% 
  html_table(fill = TRUE)

What should I do to get the specific one? Thank you!!

like image 684
Drick Avatar asked Sep 02 '25 05:09

Drick


2 Answers

Instead of indexing where table position could move, you could anchor according to relationship to element with id prize_money. Return just a single node for efficiency. Avoid longer xpaths as they can be fragile.

library(rvest)

table <- read_html('https://en.wikipedia.org/wiki/2018_FIFA_World_Cup#Prize_money') %>% 
  html_node(xpath = "//*[@id='Prize_money']/parent::h4/following-sibling::table[1]") %>% 
  html_table(fill = T)
like image 106
QHarr Avatar answered Sep 04 '25 20:09

QHarr


since you have a specific table you want to scrape you can identify in in the html_node() call by using the xpath of the webpage element:

library(dplyr)
library(rvest)

the_url <- "https://en.wikipedia.org/wiki/2018_FIFA_World_Cup"

the_url %>%
  read_html() %>% 
  html_nodes(xpath='/html/body/div[3]/div[3]/div[5]/div[1]/table[20]') %>% 
  html_table(fill=TRUE)
like image 37
DPH Avatar answered Sep 04 '25 21:09

DPH