I'm trying to read web page source into R and process it as strings. I'm trying to take the paragraphs out and remove the html tags from the paragraph text. I'm running into the following problem:
I tried implementing a function to remove the html tags:
cleanFun=function(fullStr)
{
#find location of tags and citations
tagLoc=cbind(str_locate_all(fullStr,"<")[[1]][,2],str_locate_all(fullStr,">")[[1]][,1]);
#create storage for tag strings
tagStrings=list()
#extract and store tag strings
for(i in 1:dim(tagLoc)[1])
{
tagStrings[i]=substr(fullStr,tagLoc[i,1],tagLoc[i,2]);
}
#remove tag strings from paragraph
newStr=fullStr
for(i in 1:length(tagStrings))
{
newStr=str_replace_all(newStr,tagStrings[[i]][1],"")
}
return(newStr)
};
This works for some tags but not all tags, an example where this fails is following string:
test="junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
The goal would be to obtain:
cleanFun(test)="junk junk junk junk"
However, this doesn't seem to work. I thought it might be something to do with string length or escape characters, but I couldn't find a solution involving those.
This can be achieved simply through regular expressions and the grep family:
cleanFun <- function(htmlString) { return(gsub("<.*?>", "", htmlString)) }
This will also work with multiple html tags in the same string!
This finds any instances of the pattern <.*?>
in the htmlString and replaces it with the empty string "". The ? in .*?
makes it non greedy, so if you have multiple tags (e.g., <a> junk </a>
) it will match <a>
and </a>
instead of the whole string.
You can also do this with two functions in the rvest package:
library(rvest)
strip_html <- function(s) {
html_text(read_html(s))
}
Example output:
> strip_html("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"
Note that you should not use regexes to parse HTML.
Another approach, using tm.plugin.webmining
, which uses XML
internally.
> library(tm.plugin.webmining)
> extractHTMLStrip("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With