Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get unique regex matcher results (without using maps or lists)

Is there a way to get only the unique matches? without using a list or a map after the matching, I want the matcher output to be unique right away.

Sample input/output:

String input = "This is a question from [userName] about finding unique regex matches for [inputString] without using any lists or maps. -[userName].";
Pattern pattern = Pattern.compile("\\[[^\\[\\]]*\\]");
Matcher matcher = pattern.matcher(rawText);
while (matcher.find()) {
    String tokenName = matcher.group(0);
    System.out.println(tokenName);
}

This will output the following:

[userName]
[inputString]
[userName]

But I want it to output the following:

[userName]
[inputString]
like image 421
Isaac Avatar asked Nov 28 '12 20:11

Isaac


People also ask

What is difference between matches () and find () in Java regex?

Difference between matches() and find() in Java RegexThe matches() method returns true If the regular expression matches the whole text. If not, the matches() method returns false. Whereas find() search for the occurrence of the regular expression passes to Pattern.

How do I allow only special characters in regex?

You can use this regex /^[ A-Za-z0-9_@./#&+-]*$/.

Is Java regex matcher thread safe?

The Regex class itself is thread safe and immutable (read-only). That is, Regex objects can be created on any thread and shared between threads; matching methods can be called from any thread and never alter any global state.

Does * match everything in regex?

Throw in an * (asterisk), and it will match everything. Read more. \s (whitespace metacharacter) will match any whitespace character (space; tab; line break; ...), and \S (opposite of \s ) will match anything that is not a whitespace character.


1 Answers

Yes there is. You can combine a negative lookahead and a backreference:

"(\\[[^\\[\\]]*\\])(?!.*\\1)"

That will only match if that, which was matched by your actual pattern, does not occur again in the string. Effectively, that means you always get the last occurrence of every match, so you would get them in a different order:

[inputString]
[userName]

If the order is a problem for you (i.e. if it's crucial to order them by first occurrence), you won't be able to do this using regex only. You would need a variable-length look*behind* for that, and that is not supported by Java.

Further reading:

  • Lookarounds
  • Backreferences

Some notes on a general solution

Note that this will work with any pattern whose matches are of non-zero width. The general solution is simply:

(yourPatternHere)(?!.*\1)

(I left out the double backslash, because that only applies to a few languages.)

If you want it to work with patterns that have zero-width matches (because you only want to know a position and are using lookarounds only for some reason), you could do this:

(zeroWidthPatternHere)(?!.+\1)

Also, note that (generally) you might have to use the "singleline" or "dotall" option, if your input may contain linebreaks (otherwise the lookahead will only check in the current line). If you cannot or do not want to activate that (because you have a pattern that includes periods which should not match line breaks; or because you use JavaScript), this is the general solution:

(yourPatternHere)(?![\s\S]*\1)

And to make this answer even more widely applicable, here is how you could match only the first occurrence of every match (in an engine with variable-length lookbehinds, like .NET):

(yourPatternHere)(?<!\1.*\1)
or
(yourPatternHere)(?<!\1[\s\S]*\1)
like image 112
Martin Ender Avatar answered Oct 13 '22 00:10

Martin Ender