I'm trying to parse out prices but ignore two patterns that are also prices. One of the exclusions is the total price which is at the end which I am using lookahead to ignore. The second exclusion is if there's a variation of the letter Q
before a price, for example Q10.00
or Q AWSMSN11.32
but I want to include if there's a three letter alpha that happens to end in Q
such as YMQ234.03
.
I've added a negative lookbehind but can't seem to get what I want.
This is the pattern I've tried: (?<![Q\d]) ?M?(\d+\.\d{2})(?=.*\d+\.\d{2}END)
test strings
ABC WS YMQ234.03WS TOY234.03USD468.06END
FUR BB LAB Q10.00 199.00USD209.00END
YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END
PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END
regex101
Expected output
+---------------------------------------------------------------------------+---------+---------+
| ABC WS YMQ234.03WS TOY234.03USD468.06END | 234.03 | 234.03 |
| FUR BB LAB Q10.00 199.00USD209.00END | 199.00 | |
| YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END | 2503.08 | 2503.08 |
| PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END | 342.41 | 282.24 |
+---------------------------------------------------------------------------+---------+---------+
You could use regex
module instead of re
with the pattern:
Q[A-Z ]*(?<!\b[A-Z]{2}Q)[\d.]+(*SKIP)(*F)|\d+(?:\.\d+)(?!\d*END$)
See the online demo.
In Python this could look like:
import regex
arr = ['ABC WS YMQ234.03WS TOY234.03USD468.06END', 'FUR BB LAB Q10.00 199.00USD209.00END', 'YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END', 'PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END']
res = [regex.findall(r'Q[A-Z ]*(?<!\b[A-Z]{2}Q)[\d.]+(*SKIP)(*F)|\d+(?:\.\d+)(?!\d*END$)',x) for x in arr]
print(res)
Prints:
[['234.03', '234.03'], ['199.00'], ['2503.08', '2503.08'], ['342.41', '282.24']]
You might also match what you don't want, and capture what you do want.
Match optional whitespace and uppercase chars where there is a Q
and match the decimal value that follows.
Make the exception of eliminating this match asserting that it is not preceded by 2 times an uppercase A-Z followed by Q
After the alternation, capture the decimal value in group 1, asserting that it is not followed by END
\b[A-Z ]*Q[A-Z ]*(?<![A-Z][A-Z]Q)\d+\.\d+|(\d+\.\d{2})(?!END)
Explanation
\b[A-Z ]*Q[A-Z ]*
Word boundary, match a Q
between optional spaces and uppercase chars(?<![A-Z][A-Z]Q)
Negative lookbehind, assert not 2 uppercase chars A-Z followed by Q
directly to the left\d+\.\d+
Match a decimal value|
Or(
Capture group 1
\d+\.\d{2}
Match 1+ digits followed by a dot and 2 digits)
Close group 1(?!END)
Negative lookahead, assert what is directly to the right is not END
Regex demo | Python demo
For example
import re
regex = r"\b[A-Z ]*Q[A-Z ]*(?<![A-Z][A-Z]Q)\d+\.\d+|(\d+\.\d{2})(?!END)"
strings = [
"ABC WS YMQ234.03WS TOY234.03USD468.06END",
"FUR BB LAB Q10.00 199.00USD209.00END",
"YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END",
"PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END"
]
for str in strings:
print('{}: {}'.format(str, [x.group(1) for x in re.finditer(regex, str) if x.group(1)]))
Output
ABC WS YMQ234.03WS TOY234.03USD468.06END: ['234.03', '234.03']
FUR BB LAB Q10.00 199.00USD209.00END: ['199.00']
YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END: ['2503.08', '2503.08']
PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END: ['342.41', '282.24']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With