Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Only match unique string occurrences

Tags:

regex

We are doing some Data Loss Prevention for emails, but the issue is when people reply to emails multiple times sometimes the credit card number or account number will appear multiple times.

How can we get Java Regex to only match strings once each.

So for example, we are using the following regex to catch account numbers that match 2 letters followed by 5 or 6 numbers. it will also omit CR in either case.

\b(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}\b

How can we have it find:

CX12345
CX14584
JB145888
JD748452
CX12345 (Ignore as its already found it above)
LM45855
like image 393
Chosenv3 Avatar asked Nov 07 '16 15:11

Chosenv3


2 Answers

Unique string occurrence can be matched with

<STRING_PATTERN>(?!.*<STRING_PATTERN>)  // Find the last occurrence
(?<!<STRING_PATTERN>.*)<STRING_PATTERN> // Find the first occurrence, only works in regex
                                        // that supports infinite-width lookbehind patterns

where <STRING_PATTERN> is the pattern the unique occurrence of which one searches for. Note that both will work with the .NET regex library, but the second one is not usually supported by the majority of other libraries (only PyPi Python regex library and the JavaScript ECMAScript 2018 regex support it). Note that . does not match line break chars by default, so you need to pass a modifier like DOTALL (in most libraries, you may add (?s) modifier inside the pattern (only in Ruby (?m) does the same), or use specific flags that you pass to the regex compile method. See more about this in How do I match any character across multiple lines in a regular expression?

You seem to need a regex like this:

/\b((?!CR|cr)[A-Za-z]{2}\d{5,6})\b(?![\s\S]*\b\1\b)/

The regex demo is available here

Details:

  • \b - a leading word boundary
  • ((?!CR|cr)[A-Za-z]{2}\d{5,6}) - Group 1 capturing
    • (?!CR|cr) - the next two characters cannot be CR or cr, the negative lookahead check
    • [A-Za-z]{2} - 2 ASCII letters
    • \d{5,6} - 5 to 6 digits
  • \b - trailing word boundary
  • (?![\s\S]*\b\1\b) - a negative lookahead that fails the match if there are any 0+ chars ([\s\S]*) followed with a word boundary (\b), same value captured into Group 1 (with the \1 backreference), and a trailing word boundary.
like image 191
Wiktor Stribiżew Avatar answered Oct 08 '22 08:10

Wiktor Stribiżew


I would use a Map of some sort here, to keep tally of the strings which you encounter. For example:

String ccNumber = "CX12345";
Map<String, Boolean> ccMap = new HashMap<>();

if (ccNumber.matches("^(?!CR)(?!cr)[A-Za-z]{2}[0-9]{5,6}$")) {
    ccMap.put(ccNumber, null);
}

Then just iterate over the keyset of the map to get unique credit card numbers which matched the pattern in your regex:

for (String key : map.keySet()) {
    System.out.println("Found a matching credit card: " + key);
}
like image 44
Tim Biegeleisen Avatar answered Oct 08 '22 10:10

Tim Biegeleisen