Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression matching any subset of a given set?

Is it possible to write a regular expression which will match any subset of a given set of characters
a1 ... an ?
I.e. it should match any string where any of these characters appears at most once, there are no other characters and the relative order of the characters doesn't matter.

Some approaches that arise at once:
1. [a1,...,an]* or (a1|a2|...|an)*- this allows multiple presence of characters
2. (a1?a2?...an?) - no multiple presence, but relative order is important - this matches any subsequence but not subset.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1), i.e. write all possible subsequences (just hardcode all matching strings :)) of course, not acceptable.

I also have a guess that it may be theoretically impossible, because during parsing the string we will need to remember which character we have already met before, and as far as I know regular expressions can check out only right-linear languages.

Any help will be appreciated. Thanks in advance.

like image 248
Grigor Gevorgyan Avatar asked Feb 24 '23 04:02

Grigor Gevorgyan


2 Answers

This doesn't really qualify for the language-agnostic tag, but...

^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$

see a demo on ideone.com

The first time an element is matched, it gets "checked off" by the capturing group following it. Because the group has now participated in the match, a negative lookahead for its corresponding backreference (e.g., (?!\1)) will never match again, even though the group only captured an empty string. This is an undocumented feature that is nevertheless supported in many flavors, including Java, .NET, Perl, Python, and Ruby.

This solution also requires support for forward references (i.e., a reference to a given capturing group (\1) appearing in the regex before the group itself). This seems to be a little less widely supported than the empty-groups gimmick.

like image 163
Alan Moore Avatar answered Mar 07 '23 16:03

Alan Moore


Can't think how to do it with a single regex, but this is one way to do it with n regexes: (I will usr 1 2 ... m n etc for your as)

^[23..n]*1?[23..n]*$
^[13..n]*2?[13..n]*$
...
^[12..m]*n?[12..m]*$

If all the above match, your string is a strict subset of 12..mn.

How this works: each line requires the string to consist exactly of:

  • any number of charactersm drawn fromthe set, except a particular one
  • perhaps a particular one
  • any number of charactersm drawn fromthe set, except a particular one

If this passes when every element in turn is considered as a particular one, we know:

  • there is nothing else in the string except the allowed elements
  • there is at most one of each of the allowed elements

as required.


for completeness I should say that I would only do this if I was under orders to "use regex"; if not, I'd track which allowed elements have been seen, and iterate over the characters of the string doing the obvious thing.

like image 24
AakashM Avatar answered Mar 07 '23 16:03

AakashM