Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex lookahead AND look behind

Tags:

python

regex

I have the following 2 variations of scraped data:

   txt =  '''Käuferprovision: 3 % zzgl. gesetzl. MwSt.''' # variation 1

and

    txt = '''Käuferprovision: Die Courtage i.H.v. % 3,57 inkl. MwSt. ist''' # variation 2

I'd like to make one regular expression that gets the percentage as a float, so in the first instance 3.0 and in the second 3.57

I've tried this so far:

m = re.search(r'.{3}.%.{5}',txt)
txt = m.group().split("%")[1:]
txt = ("".join(txt)).replace(",",".")
print(txt)

Which works for the variation 2 but not variaton 1.

like image 451
Dr Pi Avatar asked Mar 24 '21 19:03

Dr Pi


2 Answers

You may try this code to grab your percent values and convert them into float:

>>> import re
>>> arr = ['Käuferprovision: 3 % zzgl. gesetzl. MwSt.', 'Käuferprovision: Die Courtage i.H.v. % 3,57 inkl. MwSt. ist']
>>> rx = re.compile(r'\d+(?:[.,]\d+)*(?=\s*%)|(?<=%)\s*\d+(?:[.,]\d+)*')
>>> for s in arr:
...     for m in rx.finditer(s): print (float(m.group().replace(',', '.')))
...
3.0
3.57

RegEx Demo

Online Code Demo

like image 82
anubhava Avatar answered Oct 03 '22 06:10

anubhava


You might use an alternation with 2 capture groups, and check which group exists.

\b(\d+(?:\,\d+)?)\s*%|%\s*(\d+(?:\,\d+)?)\b

See a regex demo.

The pattern matches:

  • \b A word boundary
  • (\d+(?:\,\d+)?)\s*% Capture group 1 - match a digit with optional decimal, optional whitespace chars and %
  • | Or
  • %\s*(\d+(?:\,\d+)?) Capture group 2 - \b A word boundary - the other way around as in group 1
  • \b A word boundary

For example

import re

regex = r"\b(\d+(?:\,\d+)?)\s*%|%\s*(\d+(?:\,\d+)?)\b"
test_str = ("Käuferprovision: 3 % zzgl. gesetzl. MwSt.\n"
            "Käuferprovision: Die Courtage i.H.v. % 3,57 inkl. MwSt. ist")

matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
    if match.group(1):
        print(match.group(1).replace(',', '.'))
    else:
        print(match.group(2).replace(',', '.'))

Output

3
3.57

If the spaces between the percentage signs are fixed, you could also use lookarounds to get a match only without groups.

(?<=% )\b\d+(?:,\d+)\b|\b\d+(?:,\d+)?(?= %)

See another regex demo.

Example

import re

pattern = r"(?<=% )\b\d+(?:,\d+)\b|\b\d+(?:,\d+)?(?= %)"
test_str = ("Käuferprovision: 3 % zzgl. gesetzl. MwSt.\n"
            "Käuferprovision: Die Courtage i.H.v. % 3,57 inkl. MwSt. ist")

for s in re.findall(pattern, test_str):
    print(s.replace(",", "."))

Output

3
3.57
like image 28
The fourth bird Avatar answered Oct 03 '22 07:10

The fourth bird