Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression For Duplicate Words

People also ask

How do you match duplicate words in regex?

Following example shows how to search duplicate words in a regular expression by using p. matcher() method and m. group() method of regex. Matcher class.

What is a word boundary regex?

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ( [0-9A-Za-z_] ).

How do I remove duplicates from a sentence?

1) Split input sentence separated by space into words. 2) So to get all those strings together first we will join each string in given list of strings. 3) Now create a dictionary using Counter method having strings as keys and their frequencies as values. 4) Join each words are unique to form single string.


Try this regular expression:

\b(\w+)\s+\1\b

Here \b is a word boundary and \1 references the captured match of the first group.


I believe this regex handles more situations:

/(\b\S+\b)\s+\b\1\b/

A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html


The below expression should work correctly to find any number of consecutive words. The matching can be case insensitive.

String regex = "\\b(\\w+)(\\s+\\1\\b)*";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(input);

// Check for subsequences of input that match the compiled pattern
while (m.find()) {
     input = input.replaceAll(m.group(0), m.group(1));
}

Sample Input : Goodbye goodbye GooDbYe

Sample Output : Goodbye

Explanation:

The regex expression:

\b : Start of a word boundary

\w+ : Any number of word characters

(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.

Grouping :

m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe

m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye

Replace method shall replace all consecutive matched words with the first instance of the word.


Try this with below RE

  • \b start of word word boundary
  • \W+ any word character
  • \1 same word matched already
  • \b end of word
  • ()* Repeating again

    public static void main(String[] args) {
    
        String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";//  "/* Write a RegEx matching repeated words here. */";
        Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);
    
        Scanner in = new Scanner(System.in);
    
        int numSentences = Integer.parseInt(in.nextLine());
    
        while (numSentences-- > 0) {
            String input = in.nextLine();
    
            Matcher m = p.matcher(input);
    
            // Check for subsequences of input that match the compiled pattern
            while (m.find()) {
                input = input.replaceAll(m.group(0),m.group(1));
            }
    
            // Prints the modified sentence.
            System.out.println(input);
        }
    
        in.close();
    }
    

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)

Try this regex that can catch 2 or more duplicates words and only leave behind one single word. And the duplicate words need not even be consecutive.

/\b(\w+)\b(?=.*?\b\1\b)/ig

Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.

Example Source


The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):

(\b\w+\b)\W+\1

No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.