Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove anything within a pair of parentheses using gsub in R

Tags:

regex

r

Suppose I have string like below:

<a>b<c>

I want to remove both <a> and <c>, but I can't use gsub("<.*>","","<a>b<c>") as this will remove the b also.

I asked a similar question before, but on a second thought, I think I should learn in general, how to deal with this kind of problems. Thanks.

like image 899
lokheart Avatar asked Aug 14 '11 14:08

lokheart


People also ask

How do I remove an element from a string in R?

How to remove a character or multiple characters from a string in R? You can either use R base function gsub() or use str_replace() from stringr package to remove characters from a string or text.

How do I remove unwanted characters from R?

To remove a character in an R data frame column, we can use gsub function which will replace the character with blank. For example, if we have a data frame called df that contains a character column say x which has a character ID in each value then it can be removed by using the command gsub("ID","",as.

How does GSUB work in R?

The gsub() function in R is used for replacement operations. The functions takes the input and substitutes it against the specified values. The gsub() function always deals with regular expressions. You can use the regular expressions as the parameter of substitution.


2 Answers

Don't allow a closing bracket > in the stuff between the brackets:

z <- "<a>b<c>"
gsub("<[^>]+>","",z)
like image 114
Ben Bolker Avatar answered Oct 31 '22 19:10

Ben Bolker


You can use a non-greedy regex, eg. /<.*?>/.

This will only work for simple HTML and can be easily subverted. Consider the following HTML, which cannot easily be removed using regular expressions.

<span title="Help > Index">
like image 45
a'r Avatar answered Oct 31 '22 19:10

a'r