I have a regular expression, which selects all the words that contains all (not! any) of the specific letters, just works fine on Notepad++.
Regular Expression Pattern;
^(?=.*B)(?=.*T)(?=.*L).+$
Input Text File;
AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB
And output of the regular expression in notepad++;
LABAT
BALAT
LATAB
As It is useful for Notepad++, I tried the same regular expression on java but it is simply failed.
Here is my test code;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.lev.kelimelik.resource.*;
public class Test {
public static void main(String[] args) {
String patternString = "^(?=.*B)(?=.*T)(?=.*L).+$";
String dictionary =
"AL" + "\n"
+"BAL" + "\n"
+"BAK" + "\n"
+"LABAT" + "\n"
+"TAL" + "\n"
+"LAT" + "\n"
+"BALAT" + "\n"
+"LA" + "\n"
+"AB" + "\n"
+"LATAB" + "\n"
+"TAB" + "\n";
Pattern p = Pattern.compile(patternString, Pattern.DOTALL);
Matcher m = p.matcher(dictionary);
while(m.find())
{
System.out.println("Match: " + m.group());
}
}
}
The output is errorneous as below;
Match: AL
BAL
BAK
LABAT
TAL
LAT
BALAT
LA
AB
LATAB
TAB
My question is simply, what is the java-compatible version of this regular expression?
In real life, we rarely need to validate lines, and I see that in fact, you just use the input as an array of test data. The most common scenario is reading input line by line and perform checks on it. I agree in Notepad++ it would be a bit different solution, but in Java, a single line should be checked separately.
That said, you should not copy the same approaches on different platforms. What is good in Notepad++ does not have to be good in Java.
I suggest this almost regex-free approach (String#split() still uses it):
String dictionary_str =
"AL" + "\n"
+"BAL" + "\n"
+"BAK" + "\n"
+"LABAT" + "\n"
+"TAL" + "\n"
+"LAT" + "\n"
+"BALAT" + "\n"
+"LA" + "\n"
+"AB" + "\n"
+"LATAB" + "\n"
+"TAB" + "\n";
String[] dictionary = dictionary_str.split("\n"); // Split into lines
for (int i=0; i<dictionary.length; i++) // Iterate through lines
{
if(dictionary[i].indexOf("B") > -1 && // There must be B
dictionary[i].indexOf("T") > -1 && // There must be T
dictionary[i].indexOf("L") > -1) // There must be L
{
System.out.println("Match: " + dictionary[i]); // No need matching, print the whole line
}
}
See IDEONE demo
You should not rely on .* ever. This construct causes backtracking issues all the time. In this case, you can easily optimize it with a negated character class and possessive quantifiers:
^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)
The regex breakdown:
^ - start of string(?=[^B]*+B) - right at the start of the string, check for at least one B presence that may be preceded with 0 or more characters other than B(?=[^T]*+T) - still right at the start of the string, check for at least one T presence that may be preceded with 0 or more characters other than T(?=[^L]*+L)- still right at the start of the string, check for at least one L presence that may be preceded with 0 or more characters other than LSee Java demo:
String patternString = "^(?=[^B]*+B)(?=[^T]*+T)(?=[^L]*+L)";
String[] dictionary = {"AL", "BAL", "BAK", "LABAT", "TAL", "LAT", "BALAT", "LA", "AB", "LATAB", "TAB"};
for (int i=0; i<dictionary.length; i++)
{
Pattern p = Pattern.compile(patternString);
Matcher m = p.matcher(dictionary[i]);
if(m.find())
{
System.out.println("Match: " + dictionary[i]);
}
}
Output:
Match: LABAT
Match: BALAT
Match: LATAB
Change your Pattern to:
String patternString = ".*(?=.*B)(?=.*L)(?=.*T).*";
Output
Match: LABAT
Match: BALAT
Match: LATAB
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With