We are doing some Data Loss Prevention for emails, but the issue is when people reply to emails multiple times sometimes the credit card number or account number will appear multiple times.
How can we get Java Regex to only match strings once each.
So for example, we are using the following regex to catch account numbers that match 2 letters followed by 5 or 6 numbers. it will also omit CR in either case.
\b(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}\b
How can we have it find:
CX12345
CX14584
JB145888
JD748452
CX12345 (Ignore as its already found it above)
LM45855
Unique string occurrence can be matched with
<STRING_PATTERN>(?!.*<STRING_PATTERN>) // Find the last occurrence
(?<!<STRING_PATTERN>.*)<STRING_PATTERN> // Find the first occurrence, only works in regex
// that supports infinite-width lookbehind patterns
where <STRING_PATTERN>
is the pattern the unique occurrence of which one searches for. Note that both will work with the .NET regex library, but the second one is not usually supported by the majority of other libraries (only PyPi Python regex
library and the JavaScript ECMAScript 2018 regex support it). Note that .
does not match line break chars by default, so you need to pass a modifier like DOTALL (in most libraries, you may add (?s)
modifier inside the pattern (only in Ruby (?m)
does the same), or use specific flags that you pass to the regex compile method. See more about this in How do I match any character across multiple lines in a regular expression?
You seem to need a regex like this:
/\b((?!CR|cr)[A-Za-z]{2}\d{5,6})\b(?![\s\S]*\b\1\b)/
The regex demo is available here
Details:
\b
- a leading word boundary((?!CR|cr)[A-Za-z]{2}\d{5,6})
- Group 1 capturing
(?!CR|cr)
- the next two characters cannot be CR
or cr
, the negative lookahead check[A-Za-z]{2}
- 2 ASCII letters\d{5,6}
- 5 to 6 digits\b
- trailing word boundary(?![\s\S]*\b\1\b)
- a negative lookahead that fails the match if there are any 0+ chars ([\s\S]*
) followed with a word boundary (\b
), same value captured into Group 1 (with the \1
backreference), and a trailing word boundary.I would use a Map
of some sort here, to keep tally of the strings which you encounter. For example:
String ccNumber = "CX12345";
Map<String, Boolean> ccMap = new HashMap<>();
if (ccNumber.matches("^(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}$")) {
ccMap.put(ccNumber, null);
}
Then just iterate over the keyset of the map to get unique credit card numbers which matched the pattern in your regex:
for (String key : map.keySet()) {
System.out.println("Found a matching credit card: " + key);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With