Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex java. Why using intersection?

I have taken from this oracle tutorial on java regex, the following bit:

Intersections

To create a single character class matching only the characters common to all of its nested classes, use &&, as in [0-9&&[345]]. This particular intersection creates a single character class matching only the numbers common to both character classes: 3, 4, and 5.

Enter your regex: [0-9&&[345]] Enter input string to search: 3 I found the text "3" starting at index 0 and ending at index 1.

Why would it be useful? I mean if one wants to pattern only 345 why not only [345] instead of "the intersection"?

Thanks in advance.

like image 838
Rollerball Avatar asked Oct 05 '22 10:10

Rollerball


1 Answers

Let us consider a simple problem: match English consonants in a string. Listing out all consonants (or a list of ranges) would be one way:

[B-DF-HJ-NP-TV-Zb-df-hj-np-tv-z]

Another way is to use look-around:

(?=[A-Za-z])[^AEIOUaeiou]
(?![AEIOUaeiou])[A-Za-z]

Not sure if there is any other way to do this without the use of character class intersection.

Character class intersection solution (Java):

[A-Za-z&&[^AEIOUaeiou]]

For .NET, there is no intersection, but there is character class subtraction:

[A-Za-z-[AEIOUaeiou]]

I don't know the implementation details, but I wouldn't be surprised if character class intersection/subtraction is faster than the use of look-around, which is the cleanest alternative if character class operation is not available.

Another possible usage is when you have a pre-built character class and you want to remove some characters from it. One case that I have come across where class intersection might be applicable would be to match all whitespace characters, except for new line.

Another possible use case as @beerbajay has commented:

I think the built-in character classes are the main use case, e.g. [\p{InGreek}&&\p{Ll}] for lowercase Greek letters.

like image 78
nhahtdh Avatar answered Oct 13 '22 11:10

nhahtdh