Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex - how to capture pattern without different pattern before it?

Tags:

python

regex

I'm trying to parse out prices but ignore two patterns that are also prices. One of the exclusions is the total price which is at the end which I am using lookahead to ignore. The second exclusion is if there's a variation of the letter Q before a price, for example Q10.00 or Q AWSMSN11.32 but I want to include if there's a three letter alpha that happens to end in Q such as YMQ234.03.

I've added a negative lookbehind but can't seem to get what I want.

This is the pattern I've tried: (?<![Q\d]) ?M?(\d+\.\d{2})(?=.*\d+\.\d{2}END)

test strings

ABC WS YMQ234.03WS TOY234.03USD468.06END
FUR BB LAB Q10.00 199.00USD209.00END
YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END
PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END

regex101

Expected output

+---------------------------------------------------------------------------+---------+---------+
| ABC WS YMQ234.03WS TOY234.03USD468.06END                                  | 234.03  | 234.03  |
| FUR BB LAB Q10.00 199.00USD209.00END                                      | 199.00  |         |
| YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END | 2503.08 | 2503.08 |
| PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END                 | 342.41  | 282.24  |
+---------------------------------------------------------------------------+---------+---------+
like image 391
nobody Avatar asked Jan 24 '23 21:01

nobody


2 Answers

You could use regex module instead of re with the pattern:

Q[A-Z ]*(?<!\b[A-Z]{2}Q)[\d.]+(*SKIP)(*F)|\d+(?:\.\d+)(?!\d*END$)

See the online demo.


In Python this could look like:

import regex
arr = ['ABC WS YMQ234.03WS TOY234.03USD468.06END', 'FUR BB LAB Q10.00 199.00USD209.00END', 'YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END', 'PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END']
res = [regex.findall(r'Q[A-Z ]*(?<!\b[A-Z]{2}Q)[\d.]+(*SKIP)(*F)|\d+(?:\.\d+)(?!\d*END$)',x) for x in arr]
print(res)

Prints:

[['234.03', '234.03'], ['199.00'], ['2503.08', '2503.08'], ['342.41', '282.24']]
like image 24
JvdV Avatar answered Jan 27 '23 10:01

JvdV


You might also match what you don't want, and capture what you do want.

Match optional whitespace and uppercase chars where there is a Q and match the decimal value that follows.

Make the exception of eliminating this match asserting that it is not preceded by 2 times an uppercase A-Z followed by Q

After the alternation, capture the decimal value in group 1, asserting that it is not followed by END

\b[A-Z ]*Q[A-Z ]*(?<![A-Z][A-Z]Q)\d+\.\d+|(\d+\.\d{2})(?!END)

Explanation

  • \b[A-Z ]*Q[A-Z ]* Word boundary, match a Q between optional spaces and uppercase chars
  • (?<![A-Z][A-Z]Q) Negative lookbehind, assert not 2 uppercase chars A-Z followed by Q directly to the left
  • \d+\.\d+ Match a decimal value
  • | Or
  • ( Capture group 1
    • \d+\.\d{2} Match 1+ digits followed by a dot and 2 digits
  • ) Close group 1
  • (?!END) Negative lookahead, assert what is directly to the right is not END

Regex demo | Python demo

For example

import re

regex = r"\b[A-Z ]*Q[A-Z ]*(?<![A-Z][A-Z]Q)\d+\.\d+|(\d+\.\d{2})(?!END)"
strings = [
    "ABC WS YMQ234.03WS TOY234.03USD468.06END",
    "FUR BB LAB Q10.00 199.00USD209.00END",
    "YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END",
    "PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END"
]

for str in strings:
    print('{}: {}'.format(str, [x.group(1) for x in re.finditer(regex, str) if x.group(1)]))

Output

ABC WS YMQ234.03WS TOY234.03USD468.06END: ['234.03', '234.03']
FUR BB LAB Q10.00 199.00USD209.00END: ['199.00']
YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END: ['2503.08', '2503.08']
PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END: ['342.41', '282.24']
like image 136
The fourth bird Avatar answered Jan 27 '23 11:01

The fourth bird