Regular Expression For Duplicate Words

People also ask

How do you match duplicate words in regex?

Following example shows how to search duplicate words in a regular expression by using p. matcher() method and m. group() method of regex. Matcher class.

What is a word boundary regex?

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ( [0-9A-Za-z_] ).

How do I remove duplicates from a sentence?

1) Split input sentence separated by space into words. 2) So to get all those strings together first we will join each string in given list of strings. 3) Now create a dictionary using Counter method having strings as keys and their frequencies as values. 4) Join each words are unique to form single string.

Try this regular expression:

\b(\w+)\s+\1\b

Here \b is a word boundary and \1 references the captured match of the first group.

I believe this regex handles more situations:

/(\b\S+\b)\s+\b\1\b/

A good selection of test strings can be found here: http://callumacrae.github.com/regex-tuesday/challenge1.html

The below expression should work correctly to find any number of consecutive words. The matching can be case insensitive.

String regex = "\\b(\\w+)(\\s+\\1\\b)*";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

Matcher m = p.matcher(input);

// Check for subsequences of input that match the compiled pattern
while (m.find()) {
     input = input.replaceAll(m.group(0), m.group(1));
}

Sample Input : Goodbye goodbye GooDbYe

Sample Output : Goodbye

Explanation:

The regex expression:

\b : Start of a word boundary

\w+ : Any number of word characters

(\s+\1\b)* : Any number of space followed by word which matches the previous word and ends the word boundary. Whole thing wrapped in * helps to find more than one repetitions.

Grouping :

m.group(0) : Shall contain the matched group in above case Goodbye goodbye GooDbYe

m.group(1) : Shall contain the first word of the matched pattern in above case Goodbye

Replace method shall replace all consecutive matched words with the first instance of the word.

Try this with below RE

\b start of word word boundary
\W+ any word character
\1 same word matched already
\b end of word

()* Repeating again

public static void main(String[] args) {

    String regex = "\\b(\\w+)(\\b\\W+\\b\\1\\b)*";//  "/* Write a RegEx matching repeated words here. */";
    Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE/* Insert the correct Pattern flag here.*/);

    Scanner in = new Scanner(System.in);

    int numSentences = Integer.parseInt(in.nextLine());

    while (numSentences-- > 0) {
        String input = in.nextLine();

        Matcher m = p.matcher(input);

        // Check for subsequences of input that match the compiled pattern
        while (m.find()) {
            input = input.replaceAll(m.group(0),m.group(1));
        }

        // Prints the modified sentence.
        System.out.println(input);
    }

    in.close();
}

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)

Try this regex that can catch 2 or more duplicates words and only leave behind one single word. And the duplicate words need not even be consecutive.

/\b(\w+)\b(?=.*?\b\1\b)/ig

Here, \b is used for Word Boundary, ?= is used for positive lookahead, and \1 is used for back-referencing.

Example Source

The widely-used PCRE library can handle such situations (you won't achieve the the same with POSIX-compliant regex engines, though):

(\b\w+\b)\W+\1

No. That is an irregular grammar. There may be engine-/language-specific regular expressions that you can use, but there is no universal regular expression that can do that.

Related questions
                            
                                How to determine if a number is a prime with regex?
                            
                                What does the regex \S mean in JavaScript? [duplicate]
                            
                                Javascript split regex question
                            
                                Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
                            
                                Regex to replace everything except numbers and a decimal point
                            
                                Validating IPv4 addresses with regexp
                            
                                Get string between two strings in a string
                            
                                Regex to get string between curly braces
                            
                                Which is the correct shorthand - "regex" or "regexp" [closed]
                            
                                Regular Expression: Any character that is NOT a letter or number
                            
                                Why does Javascript's regex.exec() not always return the same value? [duplicate]
                            
                                Can't escape the backslash with regex?
                            
                                How does this giant regex work?
                            
                                Search and Replace with RegEx components in Atom editor
                            
                                What is the meaning of the 'g' flag in regular expressions?
                            
                                Can I use an OR in regex without capturing what's enclosed?
                            
                                How would I get everything before a : in a string Python
                            
                                deny direct access to a folder and file by htaccess
                            
                                Decimal number regular expression, where digit after decimal is optional

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regular Expression For Duplicate Words

Tags:

regex

duplicates

capture-group

People also ask

Regex to Strip 2+ duplicate words (consecutive/non-consecutive words)

Recent Activity

Donate For Us