With R and XPath, how do you remove format elements such as
and from the results?

Question

Using XML I can scrape the URL I need, but when I use xpathSApply on it, R returns unwanted and indicators (new lines and tabs). Here is an example:

doc <- htmlTreeParse("http://www.milesstockbridge.com/offices/", useInternal = TRUE) # scrape and parse an HTML site
xpathSApply(doc, "//div[@class='info']//h3", xmlValue) 
[1] "
						Baltimore					"     "
						Cambridge					"     "
						Easton					"        "
						Frederick					"    
[5] "
						Rockville					"     "
						Towson					"        "
						Tysons Corner					" "
						Washington					"

As explained in this question, regex functions can easily remove the unwanted format elements how to delete the in the result from website data collection? but I would rather xpath do the work first, if possible (I have hundreds of these to parse).

Also, there are functions such as translate, apparently, as in this question: Using the Translate function to remove newline characters in xml, but how do I ignore certain tags? as well as strip() that I saw in a Python question. I do not know which are available when using R and xpath.

It may be that a text() function helps, but I do not know how to include it in my xpathSApply expression. Likewise with normalize-space().

Rich Scriven · Accepted Answer

You just want the trim = TRUE argument in your xmlValue() call.

> xpathSApply(doc, "//div[@class='info']//h3", xmlValue, trim = TRUE) 
#[1] "Baltimore"     "Cambridge"     "Easton"       
#[4] "Frederick"     "Rockville"     "Towson"       
#[7] "Tysons Corner" "Washington"

With R and XPath, how do you remove format elements such as \n and \t from the results?

Tags:

html

r

xml

xpath

lawyeR

1 Answers

Rich Scriven

Recent Activity

Donate For Us