Scraping html tables into R data frames using the XML package

People also ask

Which library package is used to import HTML table in R?

HTML tables are a standard way to display tabular information online. Getting HTML table data into R is fairly straightforward with the readHTMLTable() function of the XML package.

Which function in Rvest converts HTML tables into data frames?

In rvest: Easily Harvest (Scrape) Web Pages.

…or a shorter try:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

the picked table is the longest one on the page

tables[[which.max(n.rows)]]

library(RCurl)
library(XML)

# Download page using RCurl
# You may need to set proxy details, etc.,  in the call to getURL
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
# Process escape characters
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

# Parse the html tree, ignoring errors on the page
pagetree <- htmlTreeParse(webpage, error=function(...){})

# Navigate your way through the tree. It may be possible to do this more efficiently using getNodeSet
body <- pagetree$children$html$children$body 
divbodyContent <- body$children$div$children[[1]]$children$div$children[[4]]
tables <- divbodyContent$children[names(divbodyContent)=="table"]

#In this case, the required table is the only one with class "wikitable sortable"  
tableclasses <- sapply(tables, function(x) x$attributes["class"])
thetable  <- tables[which(tableclasses=="wikitable sortable")]$table

#Get columns headers
headers <- thetable$children[[1]]$children
columnnames <- unname(sapply(headers, function(x) x$children$text$value))

# Get rows from table
content <- c()
for(i in 2:length(thetable$children))
{
   tablerow <- thetable$children[[i]]$children
   opponent <- tablerow[[1]]$children[[2]]$children$text$value
   others <- unname(sapply(tablerow[-1], function(x) x$children$text$value)) 
   content <- rbind(content, c(opponent, others))
}

# Convert to data frame
colnames(content) <- columnnames
as.data.frame(content)

Edited to add:

Sample output

                     Opponent Played Won Drawn Lost Goals for Goals against  % Won
    1               Argentina     94  36    24   34       148           150  38.3%
    2                Paraguay     72  44    17   11       160            61  61.1%
    3                 Uruguay     72  33    19   20       127            93  45.8%
    ...

Another option using Xpath.

library(RCurl)
library(XML)

theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/th", xmlValue)
results <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/td", xmlValue)

# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 8, byrow = TRUE))

# Clean up the results
content[,1] <- gsub("Â ", "", content[,1])
tablehead <- gsub("Â ", "", tablehead)
names(content) <- tablehead

Produces this result

> head(content)
   Opponent Played Won Drawn Lost Goals for Goals against % Won
1 Argentina     94  36    24   34       148           150 38.3%
2  Paraguay     72  44    17   11       160            61 61.1%
3   Uruguay     72  33    19   20       127            93 45.8%
4     Chile     64  45    12    7       147            53 70.3%
5      Peru     39  27     9    3        83            27 69.2%
6    Mexico     36  21     6    9        69            34 58.3%

The rvest along with xml2 is another popular package for parsing html web pages.

library(rvest)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
file<-read_html(theurl)
tables<-html_nodes(file, "table")
table1 <- html_table(tables[4], fill = TRUE)

The syntax is easier to use than the xml package and for most web pages the package provides all of the options ones needs.

Related questions
                            
                                How to style readonly attribute with CSS?
                            
                                How do I make text bold in HTML?
                            
                                What's the easiest way to escape HTML in Python?
                            
                                Prevent tabstop on A element (anchor link) in HTML
                            
                                Wrap text in <td> tag
                            
                                How do I fit an image (img) inside a div and keep the aspect ratio?
                            
                                Are single quotes allowed in HTML?
                            
                                Disable scrolling on `<input type=number>`
                            
                                Css height in percent not working [duplicate]
                            
                                FontAwesome icons are not showing, Why? [closed]
                            
                                How can I select item with class within a DIV?
                            
                                Position absolute and overflow hidden
                            
                                Prevent contenteditable adding <div> on ENTER - Chrome
                            
                                How to make CSS3 rounded corners hide overflow in Chrome/Opera
                            
                                Can multiple different HTML elements have the same ID if they're different elements?
                            
                                Force page scroll position to top at page refresh in HTML
                            
                                Remove scrollbar from iframe
                            
                                How do you determine what technology a website is built on? [closed]
                            
                                css 'pointer-events' property alternative for IE
                            
                                Input with display:block is not a block, why not?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scraping html tables into R data frames using the XML package

Tags:

html

parsing

r

xml

web-scraping

People also ask

Recent Activity

Donate For Us