I'm wondering if there's any way in python or perl to build a regex where you can define a set of options can appear at most once in any order. So for example I would like a derivative of <code>foo(?: [abc])*</code>, where <code>a</code>, <code>b</code>, <code>c</code> could only appear once. So: <pre class="prettyprint"><code>foo a b c foo b c a foo a b foo b </code></pre> would all be valid, but <pre class="prettyprint"><code>foo b b </code></pre> would not be

You may use this regex with a capture group and a negative lookahead: For <code>Perl</code>, you can use this variant with forward referencing: <pre class="prettyprint lang-sh prettyprint-override"><code>^foo((?!.*\1) [abc])+$ </code></pre> RegEx Demo RegEx Details: <ul> <li> <code>^</code>: Start</li> <li> <code>foo</code>: Match <code>foo</code> </li> <li> <code>(</code>: Start a capture group #1 <ul> <li> <code>(?!.*\1)</code>: Negative lookahead to assert that we don't match what we have in capture group #1 anywhere in input</li> <li> <code> [abc]</code>: Match a space followed by <code>a</code> or <code>b</code> or <code>c</code> </li> </ul> </li> <li> <code>)+</code>: End capture group #1. Repeat this group 1+ times</li> <li> <code>$</code>: End</li> </ul> As mentioned earlier, this regex is using a feature called Forward Referencing which is a back-reference to a group that appears later in the regex pattern. JGsoft, .NET, Java, Perl, PCRE, PHP, Delphi, and Ruby allow forward references but Python doesn't. <hr> Here is a work-around of same regex for Python that doesn't use forward referencing: <pre class="prettyprint lang-sh prettyprint-override"><code>^foo(?!.* ([abc]).*\1)(?: [abc])+$ </code></pre> Here we use a negative lookahead before repeated group to check and fail the match if there is any repeat of allowed substrings i.e. <code>[abc]</code>. RegEx Demo 2

You can assert that there is no match for a second match for a space and a letter at the right: <pre class="prettyprint"><code>foo(?!(?: [abc])*( [abc])(?: [abc])*\1)(?: [abc])* </code></pre> <ul> <li> <code>foo</code> Match literally</li> <li> <code>(?!</code> Negative lookahead <ul> <li> <code>(?: [abc])*</code> Match optional repetitions of a space and a b or c</li> <li> <code>( [abc])</code> Capture group, use to compare with a backreference for the same</li> <li> <code>(?: [abc])*</code> Match again a space and either a b or c</li> <li> <code>\1</code> Backreference to group 1</li> </ul> </li> <li> <code>)</code> Close lookahead</li> <li> <code>(?: [abc])*</code> Match optional repetitions or a space and either a b or c</li> </ul> Regex demo If you don't want to match only foo, you can change the quantifier to 1 or more <code>(?: [abc])+</code> <hr> A variant in perl reusing the first subpattern using <code>(?1)</code> which refers to the capture group <code>([abc])</code> <pre class="prettyprint"><code>^foo ([abc])(?: (?!\1)((?1))(?: (?!\1|\2)(?1))?)?$ </code></pre> Regex demo

If it doesn't have to be a regex: <pre class="prettyprint lang-py prettyprint-override"><code>import collections # python >=3.10 def is_a_match(sentence): words = sentence.split() return ( (len(words) > 0) and (words[0] == 'foo') and (collections.Counter(words) <= collections.Counter(['foo', 'a', 'b', 'c'])) ) # python <3.10 def is_a_match(sentence): words = sentence.split() return ( (len(words) > 0) and (words[0] == 'foo') and not (collections.Counter(words) - collections.Counter(['foo', 'a', 'b', 'c'])) ) # TESTING #foo a b c True #foo b c a True #foo a b True #foo b True #foo b b False </code></pre> Or with a set and the walrus operator: <pre class="prettyprint lang-py prettyprint-override"><code>def is_a_match(sentence): words = sentence.split() return ( (len(words) > 0) and (words[0] == 'foo') and ( (s := set(words[1:])) <= set(['a', 'b', 'c']) and len(s) == len(words) - 1 ) ) </code></pre>

You can do it using references to previously captured groups. <pre class="prettyprint lang-regex prettyprint-override"><code>foo(?: ([abc]))?(?: (?!\1)([abc]))?(?: (?!\1|\2)([abc]))?$ </code></pre> This gets quite long with many options. Such a regex can be generated dynamically, if necessary. <pre class="prettyprint lang-py prettyprint-override"><code>def match_sequence_without_repeats(options, seperator): def prevent_previous(n): if n == 0: return "" groups = "".join(rf"\{i}" for i in range(1, n + 1)) return f"(?!{groups})" return "".join( f"(?:{seperator}{prevent_previous(i)}([{options}]))?" for i in range(len(options)) ) print(f"foo{match_sequence_without_repeats('abc', ' ')}$") </code></pre>

Here is a modified version of anubhava's answer, using a backreference (which works in Python, and is easier to understand at least for me) instead of a forward reference. Match using <code> [abc]</code> inside a capturing group, then check that the text matched by the capturing group does not appear again anywhere after it: <pre class="prettyprint"><code>^foo(?:( [abc])(?!.*\1))+$ </code></pre> regex demo <ul> <li> <code>^</code>: Start</li> <li> <code>foo</code>: Match <code>foo</code> </li> <li> <code>(?:</code>: Start non-capturing group <code>(?:( [abc])(?!.*\1))</code> <ul> <li> <code>( [abc])</code>: Capturing Group 1, matching a space <code> </code> followed by either <code>a</code>, <code>b</code>, or <code>c</code> </li> <li> <code>(?!.*\1)</code>: Negative lookahead, failing to match if the text matched by the first capturing group occurs after zero or more characters matched by <code>.</code> </li> </ul> </li> <li> <code>)+</code>: End non-capturing group and match it 1 or more times</li> <li> <code>$</code>: End</li> </ul>

I have assumed that the elements of the string can be in any order and appear any number of times. For example, <code>'a foo'</code> should match and <code>'a foo b foo'</code> should not. You can do that with a series of alternations employing lookaheads, one for each substring of interest, but it becomes a bit of a dog's breakfast when there are many strings to consider. Let's suppose you wanted to match zero or one <code>"foo"</code>'s and/or zero or one <code>"a"</code>'s. You could use the following regular expression: <pre class="prettyprint"><code>^(?:(?!.*\bfoo\b)|(?=(?:(?!\bfoo\b).)*\bfoo\b(?!(.*\bfoo\b))))(?:(?!.*\ba\b)|(?=(?:(?!\ba\b).)*\ba\b(?!(.*\ba\b)))) </code></pre> Start your engine! This matches, for example, <code>'foofoo'</code>, <code>'aa'</code> and <code>afooa</code>. If they are not to be matched remove the word breaks (<code>\b</code>). Notice that this expression begins by asserting the start of the string (<code>^</code>) followed by two positive lookaheads, one for <code>'foo'</code> and one for <code>'a'</code>. To also check for, say, <code>'c'</code> one would tack on <pre class="prettyprint"><code>(?:(?!.*\bc\b)|(?=(?:(?!\bc\b).)*\bc\b(?!(.*\bc\b)))) </code></pre> which is the same as <pre class="prettyprint"><code>(?:(?!.*\ba\b)|(?=(?:(?!\ba\b).)*\ba\b(?!(.*\ba\b)))) </code></pre> with <code>\ba\b</code> changed to <code>\bc\b</code>. It would be nice to be able to use back-references but I don't see how that could be done. By hovering over the regular expression in the link an explanation is provided for each element of the expression. (If this is not clear I am referring to the cursor.) Note that <pre class="prettyprint"><code>(?!\bfoo\b). </code></pre> matches a character provided it does not begin the word <code>'foo'</code>. Therefore <pre class="prettyprint"><code>(?:(?!\bfoo\b).)* </code></pre> matches a substring that does not contain <code>'foo'</code> and does not end with <code>'f'</code> followed by <code>'oo'</code>. Would I advocate this approach in practice, as opposed to using simple string methods? Let me ponder that.

python regex where a set of options can occur at most once in a list, in any order

Tags:

python

regex

perl

I'm wondering if there's any way in python or perl to build a regex where you can define a set of options can appear at most once in any order. So for example I would like a derivative of foo(?: [abc])*, where a, b, c could only appear once. So:

foo a b c
foo b c a
foo a b
foo b

would all be valid, but

foo b b

would not be

951

asked Oct 07 '21 19:10

HardcoreHenry

Video Answer

6 Answers

You may use this regex with a capture group and a negative lookahead:

For Perl, you can use this variant with forward referencing:

^foo((?!.*\1) [abc])+$

RegEx Demo

RegEx Details:

^: Start
foo: Match foo
(: Start a capture group #1
- (?!.*\1): Negative lookahead to assert that we don't match what we have in capture group #1 anywhere in input
- [abc]: Match a space followed by a or b or c
)+: End capture group #1. Repeat this group 1+ times
$: End

As mentioned earlier, this regex is using a feature called Forward Referencing which is a back-reference to a group that appears later in the regex pattern. JGsoft, .NET, Java, Perl, PCRE, PHP, Delphi, and Ruby allow forward references but Python doesn't.

Here is a work-around of same regex for Python that doesn't use forward referencing:

^foo(?!.* ([abc]).*\1)(?: [abc])+$

Here we use a negative lookahead before repeated group to check and fail the match if there is any repeat of allowed substrings i.e. [abc].

RegEx Demo 2

answered Oct 17 '22 04:10

anubhava

You can assert that there is no match for a second match for a space and a letter at the right:

foo(?!(?: [abc])*( [abc])(?: [abc])*\1)(?: [abc])*

foo Match literally
(?! Negative lookahead
- (?: [abc])* Match optional repetitions of a space and a b or c
- ( [abc]) Capture group, use to compare with a backreference for the same
- (?: [abc])* Match again a space and either a b or c
- \1 Backreference to group 1
) Close lookahead
(?: [abc])* Match optional repetitions or a space and either a b or c

Regex demo

If you don't want to match only foo, you can change the quantifier to 1 or more (?: [abc])+

A variant in perl reusing the first subpattern using (?1) which refers to the capture group ([abc])

^foo ([abc])(?: (?!\1)((?1))(?: (?!\1|\2)(?1))?)?$

Regex demo

answered Oct 17 '22 02:10

The fourth bird

If it doesn't have to be a regex:

import collections

# python >=3.10
def is_a_match(sentence):
    words = sentence.split()
    return (
      (len(words) > 0)
      and (words[0] == 'foo')
      and (collections.Counter(words) <= collections.Counter(['foo', 'a', 'b', 'c']))
    )

# python <3.10
def is_a_match(sentence):
    words = sentence.split()
    return (
      (len(words) > 0)
      and (words[0] == 'foo')
      and not (collections.Counter(words) - collections.Counter(['foo', 'a', 'b', 'c']))
    )

# TESTING
#foo a b c True
#foo b c a True
#foo a b True
#foo b True
#foo b b False

Or with a set and the walrus operator:

def is_a_match(sentence):
    words = sentence.split()
    return (
      (len(words) > 0)
      and (words[0] == 'foo')
      and (
        (s := set(words[1:])) <= set(['a', 'b', 'c'])
        and len(s) == len(words) - 1
      )
    )

answered Oct 17 '22 02:10

Stef

You can do it using references to previously captured groups.

foo(?: ([abc]))?(?: (?!\1)([abc]))?(?: (?!\1|\2)([abc]))?$

This gets quite long with many options. Such a regex can be generated dynamically, if necessary.

def match_sequence_without_repeats(options, seperator):
    def prevent_previous(n):
        if n == 0:
            return ""
        groups = "".join(rf"\{i}" for i in range(1, n + 1))
        return f"(?!{groups})"

    return "".join(
        f"(?:{seperator}{prevent_previous(i)}([{options}]))?"
        for i in range(len(options))
    )


print(f"foo{match_sequence_without_repeats('abc', ' ')}$")

answered Oct 17 '22 04:10

LeopardShark

Here is a modified version of anubhava's answer, using a backreference (which works in Python, and is easier to understand at least for me) instead of a forward reference.

Match using [abc] inside a capturing group, then check that the text matched by the capturing group does not appear again anywhere after it:

^foo(?:( [abc])(?!.*\1))+$

regex demo

^: Start
foo: Match foo
(?:: Start non-capturing group (?:( [abc])(?!.*\1))
- ( [abc]): Capturing Group 1, matching a space followed by either a, b, or c
- (?!.*\1): Negative lookahead, failing to match if the text matched by the first capturing group occurs after zero or more characters matched by .
)+: End non-capturing group and match it 1 or more times
$: End

answered Oct 17 '22 03:10

irregular espresso

I have assumed that the elements of the string can be in any order and appear any number of times. For example, 'a foo' should match and 'a foo b foo' should not.

You can do that with a series of alternations employing lookaheads, one for each substring of interest, but it becomes a bit of a dog's breakfast when there are many strings to consider. Let's suppose you wanted to match zero or one "foo"'s and/or zero or one "a"'s. You could use the following regular expression:

^(?:(?!.*\bfoo\b)|(?=(?:(?!\bfoo\b).)*\bfoo\b(?!(.*\bfoo\b))))(?:(?!.*\ba\b)|(?=(?:(?!\ba\b).)*\ba\b(?!(.*\ba\b))))

Start your engine!

This matches, for example, 'foofoo', 'aa' and afooa. If they are not to be matched remove the word breaks (\b).

Notice that this expression begins by asserting the start of the string (^) followed by two positive lookaheads, one for 'foo' and one for 'a'. To also check for, say, 'c' one would tack on

(?:(?!.*\bc\b)|(?=(?:(?!\bc\b).)*\bc\b(?!(.*\bc\b))))

which is the same as

(?:(?!.*\ba\b)|(?=(?:(?!\ba\b).)*\ba\b(?!(.*\ba\b))))

with \ba\b changed to \bc\b.

It would be nice to be able to use back-references but I don't see how that could be done.

By hovering over the regular expression in the link an explanation is provided for each element of the expression. (If this is not clear I am referring to the cursor.)

Note that

(?!\bfoo\b).

matches a character provided it does not begin the word 'foo'. Therefore

(?:(?!\bfoo\b).)*

matches a substring that does not contain 'foo' and does not end with 'f' followed by 'oo'.

Would I advocate this approach in practice, as opposed to using simple string methods? Let me ponder that.

answered Oct 17 '22 04:10

Cary Swoveland

Related questions
                            
                                MySQLdb.cursor.execute can't run multiple queries
                            
                                How to receive automatic notifications about changes in tables?
                            
                                How do I plot hatched bars using pandas?
                            
                                assigning value in python dict (copy vs reference)
                            
                                set env var in Python multiprocessing.Process
                            
                                Next argmax values in python
                            
                                How to mock a tornado coroutine function using mock framework for unit testing?
                            
                                How to paste an image onto a larger image using Pillow?
                            
                                How to plot stacked event duration (Gantt Charts) using Python Pandas
                            
                                Zlib error when installing Pillow on Mac
                            
                                SQLAlchemy Model Circular Import
                            
                                Feeding image data in tensorflow for transfer learning
                            
                                What do the > < signs in numpy dtype mean?
                            
                                Convert rank and partition query to SqlAlchemy
                            
                                Django JSONField inside ArrayField
                            
                                NLTK tokenize - faster way?
                            
                                How to turn an itertools "grouper" object into a list
                            
                                Is it possible to lock versions of packages in Anaconda?
                            
                                How to mock aiohttp.client.ClientSession.get async context manager
                            
                                pytorch: "multi-target not supported" error message

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python regex where a set of options can occur at most once in a list, in any order

Tags:

python

regex

perl

HardcoreHenry

People also ask

Video Answer

6 Answers

anubhava

The fourth bird

Stef

LeopardShark

irregular espresso

Cary Swoveland

Recent Activity

Donate For Us