Similar to How to deal with single quote in xpath, I want to escape single quotes. The difference is that I can't exclude the possibility that a double quote might also appear in the target string.
Goal:
Escape double and single quotes simultaneously with Xpath (in R). The target element should be used as a variable and not be hard coded like in one of the existing answers. (It should be a variable, because I am unaware of the content beforehand, it could have single quotes, double quotes or both).
Works:
library(rvest)
library(magrittr)
html <- "<div>1</div><div>Father's son</div>"
target <- "Father's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"", target,"\")]"))
{xml_nodeset (1)}
[1] <div>Father's son</div>
Does not work:
html <- "<div>1</div><div>Fat\"her's son</div>"
target <- "Fat\"her's son"
html %>% xml2::read_html() %>% html_nodes(xpath = paste0("//*[contains(text(), \"", target,"\")]"))
{xml_nodeset (0)}
Warning message:
In xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
Invalid expression [1207]
Non-R answers that I could try to "translate to R" are very welcome.
The key here is realising that with xml2 you can write back into the parsed html with html-escaped characters. This function will do the trick. It's longer than it needs to be because I've included comments and some type checking / converting logic.
contains_text <- function(node_set, find_this)
{
# Ensure we have a nodeset
if(all(class(node_set) == c("xml_document", "xml_node")))
node_set %<>% xml_children()
if(class(node_set) != "xml_nodeset")
stop("contains_text requires an xml_nodeset or xml_document.")
# Get all leaf nodes
node_set %<>% xml_nodes(xpath = "//*[not(*)]")
# HTML escape the target string
find_this %<>% {gsub("\"", """, .)}
# Extract, HTML escape and replace the nodes
lapply(node_set, function(node) xml_text(node) %<>% {gsub("\"", """, .)})
# Now we can define the xpath and extract our target nodes
xpath <- paste0("//*[contains(text(), \"", find_this, "\")]")
new_nodes <- html_nodes(node_set, xpath = xpath)
# Since the underlying xml_document is passed by pointer internally,
# we should unescape any text to leave it unaltered
xml_text(node_set) %<>% {gsub(""", "\"", .)}
return(new_nodes)
}
Now:
library(rvest)
library(xml2)
html %>% xml2::read_html() %>% contains_text(target)
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>
html %>% xml2::read_html() %>% contains_text(target) %>% xml_text()
#> [1] "Fat\"her's son"
ADDENDUM
This is an alternative method, which is an implementation of the method suggested by @Alejandro but allows arbitrary targets. It has the merit of leaving the xml document untouched, and is a little faster than the above method, but involves the kind of string parsing that an xml library is supposed to prevent. It works by taking the target, splitting it after each "
and '
, then enclosing each fragment in the opposite type of quote to the one it contains before pasting them all back together with commas and inserting them into an XPath concatenate
function.
library(stringr)
safe_xpath <- function(target)
{
target %<>%
str_replace_all("\"", ""&break;") %>%
str_replace_all("'", "&apo;&break;") %>%
str_split("&break;") %>%
unlist()
safe_pieces <- grep("(")|(&apo;)", target, invert = TRUE)
contain_quotes <- grep(""", target)
contain_apo <- grep("&apo;", target)
if(length(safe_pieces) > 0)
target[safe_pieces] <- paste0("\"", target[safe_pieces], "\"")
if(length(contain_quotes) > 0)
{
target[contain_quotes] <- paste0("'", target[contain_quotes], "'")
target[contain_quotes] <- gsub(""", "\"", target[contain_quotes])
}
if(length(contain_apo) > 0)
{
target[contain_apo] <- paste0("\"", target[contain_apo], "\"")
target[contain_apo] <- gsub("&apo;", "'", target[contain_apo])
}
fragment <- paste0(target, collapse = ",")
return(paste0("//*[contains(text(),concat(", fragment, "))]"))
}
Now we can generate a valid xpath like this:
safe_xpath(target)
#> [1] "//*[contains(text(),concat('Fat\"',\"her'\",\"s son\"))]"
so that
html %>% xml2::read_html() %>% html_nodes(xpath = safe_xpath(target))
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With