Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to find and remove duplicate words

Tags:

string

c#

regex

Using regular expressions in C#, is there any way to find and remove duplicate words or symbols in a string containing a variety of words and symbols?

Ex.

Initial string of words:

"I like the environment. The environment is good."

Desired string:

"I like the environment. is good"

Duplicates removed: "the", "environment", "."

like image 510
triniMahn Avatar asked Jun 29 '09 14:06

triniMahn


People also ask

How do I remove duplicates from a sentence?

1) Split input sentence separated by space into words. 2) So to get all those strings together first we will join each string in given list of strings. 3) Now create a dictionary using Counter method having strings as keys and their frequencies as values. 4) Join each words are unique to form single string.

Does sort function remove duplicates?

In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.


2 Answers

As said by others, you need more than a regex to keep track of words:

var words = new HashSet<string>();
string text = "I like the environment. The environment is good.";
text = Regex.Replace(text, "\\w+", m =>
                     words.Add(m.Value.ToUpperInvariant())
                         ? m.Value
                         : String.Empty);
like image 147
Per Erik Stendahl Avatar answered Sep 24 '22 00:09

Per Erik Stendahl


This seems to work for me

(\b\S+\b)(?=.*\1)

Matches like so

apple apple orange  
orange red blue green orange green blue  
pirates ninjas cowboys ninjas pirates  
like image 28
Jeff Atwood Avatar answered Sep 22 '22 00:09

Jeff Atwood