I have the following 2 variations of scraped data:
txt = '''Käuferprovision: 3 % zzgl. gesetzl. MwSt.''' # variation 1
and
txt = '''Käuferprovision: Die Courtage i.H.v. % 3,57 inkl. MwSt. ist''' # variation 2
I'd like to make one regular expression that gets the percentage as a float, so in the first instance 3.0 and in the second 3.57
I've tried this so far:
m = re.search(r'.{3}.%.{5}',txt)
txt = m.group().split("%")[1:]
txt = ("".join(txt)).replace(",",".")
print(txt)
Which works for the variation 2 but not variaton 1.
You may try this code to grab your percent values and convert them into float
:
>>> import re
>>> arr = ['Käuferprovision: 3 % zzgl. gesetzl. MwSt.', 'Käuferprovision: Die Courtage i.H.v. % 3,57 inkl. MwSt. ist']
>>> rx = re.compile(r'\d+(?:[.,]\d+)*(?=\s*%)|(?<=%)\s*\d+(?:[.,]\d+)*')
>>> for s in arr:
... for m in rx.finditer(s): print (float(m.group().replace(',', '.')))
...
3.0
3.57
RegEx Demo
Online Code Demo
You might use an alternation with 2 capture groups, and check which group exists.
\b(\d+(?:\,\d+)?)\s*%|%\s*(\d+(?:\,\d+)?)\b
See a regex demo.
The pattern matches:
\b
A word boundary(\d+(?:\,\d+)?)\s*%
Capture group 1 - match a digit with optional decimal, optional whitespace chars and %
|
Or%\s*(\d+(?:\,\d+)?)
Capture group 2 - \b
A word boundary - the other way around as in group 1\b
A word boundaryFor example
import re
regex = r"\b(\d+(?:\,\d+)?)\s*%|%\s*(\d+(?:\,\d+)?)\b"
test_str = ("Käuferprovision: 3 % zzgl. gesetzl. MwSt.\n"
"Käuferprovision: Die Courtage i.H.v. % 3,57 inkl. MwSt. ist")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
if match.group(1):
print(match.group(1).replace(',', '.'))
else:
print(match.group(2).replace(',', '.'))
Output
3
3.57
If the spaces between the percentage signs are fixed, you could also use lookarounds to get a match only without groups.
(?<=% )\b\d+(?:,\d+)\b|\b\d+(?:,\d+)?(?= %)
See another regex demo.
Example
import re
pattern = r"(?<=% )\b\d+(?:,\d+)\b|\b\d+(?:,\d+)?(?= %)"
test_str = ("Käuferprovision: 3 % zzgl. gesetzl. MwSt.\n"
"Käuferprovision: Die Courtage i.H.v. % 3,57 inkl. MwSt. ist")
for s in re.findall(pattern, test_str):
print(s.replace(",", "."))
Output
3
3.57
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With