Using XML
I can scrape the URL I need, but when I use xpathSApply
on it, R returns unwanted \n and \t indicators (new lines and tabs). Here is an example:
doc <- htmlTreeParse("http://www.milesstockbridge.com/offices/", useInternal = TRUE) # scrape and parse an HTML site
xpathSApply(doc, "//div[@class='info']//h3", xmlValue)
[1] "\n\t\t\t\t\t\tBaltimore\t\t\t\t\t" "\n\t\t\t\t\t\tCambridge\t\t\t\t\t" "\n\t\t\t\t\t\tEaston\t\t\t\t\t" "\n\t\t\t\t\t\tFrederick\t\t\t\t\t"
[5] "\n\t\t\t\t\t\tRockville\t\t\t\t\t" "\n\t\t\t\t\t\tTowson\t\t\t\t\t" "\n\t\t\t\t\t\tTysons Corner\t\t\t\t\t" "\n\t\t\t\t\t\tWashington\t\t\t\t\t"
As explained in this question, regex functions can easily remove the unwanted format elements how to delete the \n\t\t\t in the result from website data collection? but I would rather xpath do the work first, if possible (I have hundreds of these to parse).
Also, there are functions such as translate
, apparently, as in this question:
Using the Translate function to remove newline characters in xml, but how do I ignore certain tags? as well as strip()
that I saw in a Python question. I do not know which are available when using R and xpath.
It may be that a text()
function helps, but I do not know how to include it in my xpathSApply expression. Likewise with normalize-space()
.
You just want the trim = TRUE
argument in your xmlValue()
call.
> xpathSApply(doc, "//div[@class='info']//h3", xmlValue, trim = TRUE)
#[1] "Baltimore" "Cambridge" "Easton"
#[4] "Frederick" "Rockville" "Towson"
#[7] "Tysons Corner" "Washington"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With