Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse 4th capital letter of line in Python?

Tags:

python

How can I parse lines of text from the 4th occurrence of a capital letter onward? For example given the lines:

adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj
oiwuewHsajlkjfasNasldjgalskjgasdIasdllksjdgaPlsdakjfsldgjQ

I would like to capture:

`ZsdalkjgalsdkjTlaksdjfgasdkgj`
`PlsdakjfsldgjQ`

I'm sure there is probably a better way than regular expressions, but I was attempted to do a non-greedy match; something like this:

match = re.search(r'[A-Z].*?$', line).group()
like image 362
drbunsen Avatar asked Nov 30 '22 06:11

drbunsen


2 Answers

I present two approaches.

Approach 1: all-out regex

In [1]: import re

In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'

In [3]: re.match(r'(?:.*?[A-Z]){3}.*?([A-Z].*)', s).group(1)
Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'

The .*?[A-Z] consumes characters up to, and including, the first uppercase letter.

The (?:...){3} repeats the above three times without creating any capture groups.

The following .*? matches the remaining characters before the fourth uppercase letter.

Finally, the ([A-Z].*) captures the fourth uppercase letter and everything that follows into a capture group.

Approach 2: simpler regex

In [1]: import re

In [2]: s = 'adsgasdlkgasYasdgjaUUalsdkjgaZsdalkjgalsdkjTlaksdjfgasdkgj'

In [3]: ''.join(re.findall(r'[A-Z][^A-Z]*', s)[3:])
Out[3]: 'ZsdalkjgalsdkjTlaksdjfgasdkgj'

This attacks the problem directly, and I think is easier to read.

like image 109
NPE Avatar answered Dec 06 '22 10:12

NPE


Anyway not using regular expressions will seen to be too verbose - although at the bytcodelevel it is a very simple algorithm running, and therefore lightweight.

It may be that regexpsare faster, since they are implemented in native code, but the "one obvious way to do it", though boring, certainly beats any suitable regexp in readability hands down:

def find_capital(string, n=4):
    count = 0
    for index, letter in enumerate(string):
        # The boolean value counts as 0 for False or 1 for True
        count += letter.isupper()  
        if count == n:
            return string[index:]
    return ""
like image 26
jsbueno Avatar answered Dec 06 '22 08:12

jsbueno