Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use Regex to find a string of characters in alphabetical order using Python?

So I have a challenge I'm working on - find the longest string of alphabetical characters in a string. For example, "abcghiijkyxz" should result in "ghiijk" (Yes the i is doubled).

I've been doing quite a bit with loops to solve the problem - iterating over the entire string, then for each character, starting a second loop using lower and ord. No help needed writing that loop.

However, it was suggested to me that Regex would be great for this sort of thing. My regex is weak (I know how to grab a static set, my look-forwards knowledge extends to knowing they exist). How would I write a Regex to look forward, and check future characters for being next in alphabetical order? Or is the suggestion to use Regex not practical for this type of thing?

Edit: The general consensus seems to be that Regex is indeed terrible for this type of thing.

like image 880
Selkie Avatar asked Feb 02 '18 18:02

Selkie


2 Answers

Just to demonstrate why regex is not practical for this sort of thing, here is a regex that would match ghiijk in your given example of abcghiijkyxz. Note it'll also match abc, y, x, z since they should technically be considered for longest string of alphabetical characters in order. Unfortunately, you can't determine which is the longest with regex alone, but this does give you all the possibilities. Please note that this regex works for PCRE and will not work with python's re module! Also, note that python's regex library does not currently support (*ACCEPT). Although I haven't tested, the pyre2 package (python wrapper for Google's re2 pyre2 using Cython) claims it supports the (*ACCEPT) control verb, so this may currently be possible using python.

See regex in use here

((?:a+(?(?!b)(*ACCEPT))|b+(?(?!c)(*ACCEPT))|c+(?(?!d)(*ACCEPT))|d+(?(?!e)(*ACCEPT))|e+(?(?!f)(*ACCEPT))|f+(?(?!g)(*ACCEPT))|g+(?(?!h)(*ACCEPT))|h+(?(?!i)(*ACCEPT))|i+(?(?!j)(*ACCEPT))|j+(?(?!k)(*ACCEPT))|k+(?(?!l)(*ACCEPT))|l+(?(?!m)(*ACCEPT))|m+(?(?!n)(*ACCEPT))|n+(?(?!o)(*ACCEPT))|o+(?(?!p)(*ACCEPT))|p+(?(?!q)(*ACCEPT))|q+(?(?!r)(*ACCEPT))|r+(?(?!s)(*ACCEPT))|s+(?(?!t)(*ACCEPT))|t+(?(?!u)(*ACCEPT))|u+(?(?!v)(*ACCEPT))|v+(?(?!w)(*ACCEPT))|w+(?(?!x)(*ACCEPT))|x+(?(?!y)(*ACCEPT))|y+(?(?!z)(*ACCEPT))|z+(?(?!$)(*ACCEPT)))+)

Results in:

abc
ghiijk
y
x
z

Explanation of a single option, i.e. a+(?(?!b)(*ACCEPT)):

  • a+ Matches a (literally) one or more times. This catches instances where several of the same characters are in sequence such as aa.
  • (?(?!b)(*ACCEPT)) If clause evaluating the condition.
    • (?!b) Condition for the if clause. Negative lookahead ensuring what follows is not b. This is because if it's not b, we want the following control verb to take effect.
    • (*ACCEPT) If the condition (above) is met, we accept the current solution. This control verb causes the regex to end successfully, skipping the rest of the pattern. Since this token is inside a capturing group, only that capturing group is ended successfully at that particular location, while the parent pattern continues to execute.

So what happens if the condition is not met? Well, that means that (?!b) evaluated to false. This means that the following character is, in fact, b and so we allow the matching (rather capturing in this instance) to continue. Note that the entire pattern is wrapped in (?:)+ which allows us to match consecutive options until the (*ACCEPT) control verb or end of line is met.

The only exception to this whole regular expression is that of z. Being that it's the last character in the English alphabet (which I presume is the target of this question), we don't care what comes after, so we can simply put z+(?(?!$)(*ACCEPT)), which will ensure nothing matches after z. If you, instead, want to match za (circular alphabetical order matching - idk if this is the proper terminology, but it sounds right to me) you can use z+(?(?!a)(*ACCEPT)))+ as seen here.

like image 95
ctwheels Avatar answered Oct 15 '22 21:10

ctwheels


Generate all the regex substrings like ^a+b+c+$ (longest to shortest). Then match each of those regexs against all the substrings (longest to shortest) of "abcghiijkyxz" and stop at the first match.

def all_substrings(s):
    n = len(s)
    for i in xrange(n, 0, -1):
        for j in xrange(n - i + 1):
            yield s[j:j + i]

def longest_alphabetical_substring(s):
    for t in all_substrings("abcdefghijklmnopqrstuvwxyz"):
        r = re.compile("^" + "".join(map(lambda x: x + "+", t)) + "$")
        for u in all_substrings(s):
            if r.match(u):
                return u

print longest_alphabetical_substring("abcghiijkyxz")

That prints "ghiijk".

like image 44
Waxrat Avatar answered Oct 15 '22 23:10

Waxrat