I saw this as an answer for finding repeated words in a string. But when I use it, it thinks This
and is
are the same and deletes the is
.
Regex
"\\b(\\w+)\\b\\s+\\1"
Any idea why this is happening?
Here is the code that I am using for duplicate removal
public static String RemoveDuplicateWords(String input)
{
String originalText = input;
String output = "";
Pattern p = Pattern.compile("\b(\w+)\b\s+\b\1\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
//Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\1", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
if (!m.find())
output = "No duplicates found, no changes made to data";
else
{
while (m.find())
{
if (output == "")
output = input.replaceFirst(m.group(), m.group(1));
else
output = output.replaceAll(m.group(), m.group(1));
}
input = output;
m = p.matcher(input);
while (m.find())
{
output = "";
if (output == "")
output = input.replaceAll(m.group(), m.group(1));
else
output = output.replaceAll(m.group(), m.group(1));
}
}
return output;
}
Following example shows how to search duplicate words in a regular expression by using p. matcher() method and m. group() method of regex. Matcher class.
Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement.
Try this one:
String pattern = "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+";
Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
String input = "your string";
Matcher m = r.matcher(input);
while (m.find()) {
input = input.replaceAll(m.group(), m.group(1));
}
System.out.println(input);
The Java regular expressions are explained very well in the API documentation of the Pattern class. After adding some spaces to indicate the different parts of the regular expression:
"(?i) \\b ([a-z]+) \\b (?: \\s+ \\1 \\b )+"
\b match a word boundary
[a-z]+ match a word with one or more characters;
the parentheses capture the word as a group
\b match a word boundary
(?: indicates a non-capturing group (which starts here)
\s+ match one or more white space characters
\1 is a back reference to the first (captured) group;
so the word is repeated here
\b match a word boundary
)+ indicates the end of the non-capturing group and
allows it to occur one or more times
you should have used \b(\w+)\b\s+\b\1\b
, click here to see the result...
Hope this is what you want...
Well well well, the output that you have is
import java.util.regex.*;
public class MyDup {
public static void main (String args[]) {
String input="This This is text text another another";
String originalText = input;
String output = "";
Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\b\\1\\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
System.out.println(m);
if (!m.find())
output = "No duplicates found, no changes made to data";
else
{
while (m.find())
{
if (output == "") {
output = input.replaceFirst(m.group(), m.group(1));
} else {
output = output.replaceAll(m.group(), m.group(1));
}
}
input = output;
m = p.matcher(input);
while (m.find())
{
output = "";
if (output == "") {
output = input.replaceAll(m.group(), m.group(1));
} else {
output = output.replaceAll(m.group(), m.group(1));
}
}
}
System.out.println("After removing duplicate the final string is " + output);
}
Run this code and see what you get as output... Your queries will be solved...
In output
you are replacing duplicate by single word... Isn't it??
When I put System.out.println(m.group() + " : " + m.group(1));
in first if condition I get output as text text : text
i.e. duplicates are replacing by single word.
else
{
while (m.find())
{
if (output == "") {
System.out.println(m.group() + " : " + m.group(1));
output = input.replaceFirst(m.group(), m.group(1));
} else {
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With