Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex in java for finding duplicate consecutive words

Tags:

java

regex

I saw this as an answer for finding repeated words in a string. But when I use it, it thinks This and is are the same and deletes the is.

Regex

"\\b(\\w+)\\b\\s+\\1"

Any idea why this is happening?

Here is the code that I am using for duplicate removal

public static String RemoveDuplicateWords(String input)
{
    String originalText = input;
    String output = "";
    Pattern p = Pattern.compile("\b(\w+)\b\s+\b\1\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE); 
    //Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\1", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(input);
    if (!m.find())
        output = "No duplicates found, no changes made to data";
    else
    {
        while (m.find())
        {
            if (output == "")
                output = input.replaceFirst(m.group(), m.group(1));
            else
                output = output.replaceAll(m.group(), m.group(1));
        }
        input = output;
        m = p.matcher(input);
        while (m.find())
        {
            output = "";
            if (output == "")
                output = input.replaceAll(m.group(), m.group(1));
            else
                output = output.replaceAll(m.group(), m.group(1));
        }
    }
    return output;
}
like image 237
user1190265 Avatar asked Feb 05 '12 06:02

user1190265


People also ask

How do you match duplicate words in regex?

Following example shows how to search duplicate words in a regular expression by using p. matcher() method and m. group() method of regex. Matcher class.

Is regex faster than for loop Java?

Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement.


2 Answers

Try this one:

String pattern = "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+";
Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);

String input = "your string";
Matcher m = r.matcher(input);
while (m.find()) {
    input = input.replaceAll(m.group(), m.group(1));
}
System.out.println(input);

The Java regular expressions are explained very well in the API documentation of the Pattern class. After adding some spaces to indicate the different parts of the regular expression:

"(?i) \\b ([a-z]+) \\b (?: \\s+ \\1 \\b )+"

\b       match a word boundary
[a-z]+   match a word with one or more characters;
         the parentheses capture the word as a group    
\b       match a word boundary
(?:      indicates a non-capturing group (which starts here)
\s+      match one or more white space characters
\1       is a back reference to the first (captured) group;
         so the word is repeated here
\b       match a word boundary
)+       indicates the end of the non-capturing group and
         allows it to occur one or more times
like image 78
Mina Wissa Avatar answered Sep 23 '22 01:09

Mina Wissa


you should have used \b(\w+)\b\s+\b\1\b, click here to see the result...

Hope this is what you want...

Update 1

Well well well, the output that you have is

the final string after removing duplicates

import java.util.regex.*;

public class MyDup {
    public static void main (String args[]) {
    String input="This This is text text another another";
    String originalText = input;
    String output = "";
    Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\b\\1\\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(input);
    System.out.println(m);
    if (!m.find())
        output = "No duplicates found, no changes made to data";
    else
    {
        while (m.find())
        {
            if (output == "") {
                output = input.replaceFirst(m.group(), m.group(1));
            } else {
                output = output.replaceAll(m.group(), m.group(1));
            }
        }
        input = output;
        m = p.matcher(input);
        while (m.find())
        {
            output = "";
            if (output == "") {
                output = input.replaceAll(m.group(), m.group(1));
            } else {
                output = output.replaceAll(m.group(), m.group(1));
            }
        }
    }
    System.out.println("After removing duplicate the final string is " + output);
}

Run this code and see what you get as output... Your queries will be solved...

Note

In output you are replacing duplicate by single word... Isn't it??

When I put System.out.println(m.group() + " : " + m.group(1)); in first if condition I get output as text text : text i.e. duplicates are replacing by single word.

else
    {
        while (m.find())
        {
            if (output == "") {
                System.out.println(m.group() + " : " + m.group(1));
                output = input.replaceFirst(m.group(), m.group(1));
            } else {

Hope you got now what is going on... :)

Good Luck!!! Cheers!!!

like image 31
Fahim Parkar Avatar answered Sep 22 '22 01:09

Fahim Parkar