Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

With R and XPath, how do you remove format elements such as \n and \t from the results?

Tags:

html

r

xml

xpath

Using XML I can scrape the URL I need, but when I use xpathSApply on it, R returns unwanted \n and \t indicators (new lines and tabs). Here is an example:

doc <- htmlTreeParse("http://www.milesstockbridge.com/offices/", useInternal = TRUE) # scrape and parse an HTML site
xpathSApply(doc, "//div[@class='info']//h3", xmlValue) 
[1] "\n\t\t\t\t\t\tBaltimore\t\t\t\t\t"     "\n\t\t\t\t\t\tCambridge\t\t\t\t\t"     "\n\t\t\t\t\t\tEaston\t\t\t\t\t"        "\n\t\t\t\t\t\tFrederick\t\t\t\t\t"    
[5] "\n\t\t\t\t\t\tRockville\t\t\t\t\t"     "\n\t\t\t\t\t\tTowson\t\t\t\t\t"        "\n\t\t\t\t\t\tTysons Corner\t\t\t\t\t" "\n\t\t\t\t\t\tWashington\t\t\t\t\t" 

As explained in this question, regex functions can easily remove the unwanted format elements how to delete the \n\t\t\t in the result from website data collection? but I would rather xpath do the work first, if possible (I have hundreds of these to parse).

Also, there are functions such as translate, apparently, as in this question: Using the Translate function to remove newline characters in xml, but how do I ignore certain tags? as well as strip() that I saw in a Python question. I do not know which are available when using R and xpath.

It may be that a text() function helps, but I do not know how to include it in my xpathSApply expression. Likewise with normalize-space().

like image 423
lawyeR Avatar asked Oct 01 '22 02:10

lawyeR


1 Answers

You just want the trim = TRUE argument in your xmlValue() call.

> xpathSApply(doc, "//div[@class='info']//h3", xmlValue, trim = TRUE) 
#[1] "Baltimore"     "Cambridge"     "Easton"       
#[4] "Frederick"     "Rockville"     "Towson"       
#[7] "Tysons Corner" "Washington"  
like image 153
Rich Scriven Avatar answered Oct 03 '22 01:10

Rich Scriven