Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing html tags from a string in R

Tags:

string

r

I'm trying to read web page source into R and process it as strings. I'm trying to take the paragraphs out and remove the html tags from the paragraph text. I'm running into the following problem:

I tried implementing a function to remove the html tags:

cleanFun=function(fullStr)
{
 #find location of tags and citations
 tagLoc=cbind(str_locate_all(fullStr,"<")[[1]][,2],str_locate_all(fullStr,">")[[1]][,1]);

 #create storage for tag strings
 tagStrings=list()

 #extract and store tag strings
 for(i in 1:dim(tagLoc)[1])
 {
   tagStrings[i]=substr(fullStr,tagLoc[i,1],tagLoc[i,2]);
 }

 #remove tag strings from paragraph
 newStr=fullStr
 for(i in 1:length(tagStrings))
 {
   newStr=str_replace_all(newStr,tagStrings[[i]][1],"")
 }
 return(newStr)
};

This works for some tags but not all tags, an example where this fails is following string:

test="junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"

The goal would be to obtain:

cleanFun(test)="junk junk junk junk"

However, this doesn't seem to work. I thought it might be something to do with string length or escape characters, but I couldn't find a solution involving those.

like image 738
Ryan Warnick Avatar asked Jun 21 '13 03:06

Ryan Warnick


3 Answers

This can be achieved simply through regular expressions and the grep family:

cleanFun <- function(htmlString) {   return(gsub("<.*?>", "", htmlString)) } 

This will also work with multiple html tags in the same string!

This finds any instances of the pattern <.*?> in the htmlString and replaces it with the empty string "". The ? in .*? makes it non greedy, so if you have multiple tags (e.g., <a> junk </a>) it will match <a> and </a> instead of the whole string.

like image 79
Scott Ritchie Avatar answered Sep 29 '22 11:09

Scott Ritchie


You can also do this with two functions in the rvest package:

library(rvest)

strip_html <- function(s) {
    html_text(read_html(s))
}

Example output:

> strip_html("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"

Note that you should not use regexes to parse HTML.

like image 38
David Robinson Avatar answered Sep 29 '22 12:09

David Robinson


Another approach, using tm.plugin.webmining, which uses XML internally.

> library(tm.plugin.webmining)
> extractHTMLStrip("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"
like image 43
Peyton Avatar answered Sep 29 '22 12:09

Peyton