I have a set of html pages. I want to extract all table nodes where the attribute "border" = 1. Here is an example:
<table border="1" cellspacing="0" cellpadding="5">
<tbody><tr><td>
<table border="0" cellpadding="2" cellspacing="0">
<tbody><tr>
<td bgcolor="#ff9999"><strong><font size="+1">CASEID</font></strong></td>
</tr></tbody>
</table>
<tr><td>[tbody]
</table>
In the example, I want to select the table node where border=1 but not the tables where border = 0. I am using html_nodes()
from rvest
but can't figure out how to add attributes:
html_nodes(x, "table")
Check out the CSS3 selectors documentation that’s linked from the documentation of html_nodes
. It provides a thorough explanation of the CSS selector syntax.
For you case, you want
html_nodes(x, "tag[attribute]")
to select all tag
s with attribute
set, or
html_nodes(x, "tag[attribute=value]")
to select all tag
s with attribute
set to value
.
There are 2 major ways to find nodes from HTML and similar documents: CSS selectors and XPath. CSS is often easier but isn't capable of more complex use cases, whereas XPath has functions that can do things like search text within a node. Which one to use is always up for debate but I think it's worthwhile to try them both.
library(rvest)
with_css <- html_nodes(x, css = "table[border='1']")
with_css
#> {xml_nodeset (1)}
#> [1] <table border="1" cellspacing="0" cellpadding="5"><tbody>\n<tr><td>\n ...
Verifying that the table looks right:
html_table(with_css, fill = TRUE)
#> [[1]]
#> X1 X2
#> 1 CASEID CASEID
#> 2 CASEID <NA>
#> 3 [tbody] <NA>
The equivalent XPath gets the same table.
with_xpath <- html_nodes(x, xpath = "//table[@border=1]")
with_xpath
#> {xml_nodeset (1)}
#> [1] <table border="1" cellspacing="0" cellpadding="5"><tbody>\n<tr><td>\n ...
html_table(with_xpath, fill = TRUE)
#> [[1]]
#> X1 X2
#> 1 CASEID CASEID
#> 2 CASEID <NA>
#> 3 [tbody] <NA>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With