Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pattern-matching. Match 'c[any number of consecutive a's, b's, or c's or b's, c's, or a's etc.]t'

Sorry about the title, I couldn't come up with a clean way to ask my question.

In Python I would like to match an expression 'c[some stuff]t', where [some stuff] could be any number of consecutive a's, b's, or c's and in any order.

For example, these work: 'ct', 'cat', 'cbbt', 'caaabbct', 'cbbccaat'

but these don't: 'cbcbbaat', 'caaccbabbt'

Edit: a's, b's, and c's are just an example but I would really like to be able to extend this to more letters. I'm interested in regex and non-regex solutions.

like image 765
Usagi Avatar asked Jul 11 '11 17:07

Usagi


People also ask

What is structural pattern matching in Python?

'Structural Pattern Matching' was newly introduced in Python 3.10. The syntax for this new feature was proposed in PEP 622 in JUne 2020. The pattern matching statement of Python was inspired by similar syntax found in Scala, Erlang, and other languages.

Does Python have a match statement?

Python 3.10 was released in mid-2021 and comes with structural pattern matching, also known as a match case statement. This is Python 3.10's most important new feature; the new functionality allows you to more easily control the flow of your programs by executing certain parts of code if conditions (or cases) are met.

Does Python 3.9 have match?

As of early 2021, the match keyword does not exist in the released Python versions <= 3.9. Since Python doesn't have any functionality similar to switch/case in other languages, you'd typically use nested if/elif/else statements or a dictionary.


2 Answers

Not thoroughly tested, but I think this should work:

import re

words = ['ct', 'cat', 'cbbt', 'caaabbct', 'cbbccaat',  'cbcbbaat', 'caaccbabbt']
pat = re.compile(r'^c(?:([abc])\1*(?!.*\1))*t$')
for w in words:
    print w, "matches" if pat.match(w) else "doesn't match"

#ct matches
#cat matches
#cbbt matches
#caaabbct matches
#cbbccaat matches
#cbcbbaat doesn't match
#caaccbabbt doesn't match

This matches runs of a, b or c (that's the ([abc])\1* part), while the negative lookahead (?!.*\1) makes sure no other instance of that character is present after the run.

(edit: fixed a typo in the explanation)

like image 186
mhyfritz Avatar answered Oct 12 '22 11:10

mhyfritz


Not sure how attached you are to regex, but here is a solution using a different method:

from itertools import groupby

words = ['ct', 'cat', 'cbbt', 'caaabbct', 'cbbccaat',  'cbcbbaat', 'caaccbabbt']
for w in words:
    match = False
    if w.startswith('c') and w.endswith('t'):
        temp = w[1:-1]
        s = set(temp)
        match = s <= set('abc') and len(s) == len(list(groupby(temp)))
    print w, "matches" if match else "doesn't match"

The string matches if a set of the middle characters is a subset of set('abc') and the number of groups returned by groupby() is the same as the number of elements in the set.

like image 24
Andrew Clark Avatar answered Oct 12 '22 11:10

Andrew Clark