I have a strings that contain different values (numeric and non-numeric mixed). I want to be able to extract the values from the text. I could not get my head around how to extract all (or most of) possible cases. I have a partially working solution like this,
def extract_values(sentence):
#sentence = normalizeString(sentence)
matches = re.findall(r'((\d*\.?\d+(?:\/\d*\.?\d+)?)(?:\s+and\s+(\d*\.?\d+(?:\/\d*\.?\d+)?))?)', sentence)
# (\d\sto\s\d\s(and\s\d\/\d)*) << for adding 9 to 11, couldn't fix
result = []
for x,y,z in matches:
if '/' in x:
result.append(x)
else:
result.extend(filter(lambda x: x!="", [y,z]))
return result
Driver code,
extract_values("He is 1 and 1/2 years old. He is .5 years old and he is 5 years old. He is between 9 to 11 or 9 to 9 and 1/2. He was born 11/12/20")
Incorrect answer:
['1 and 1/2', '5', '5', '9', '11', '9', '9 and 1/2', '11/12', '20']
Expected answer:
['1 and 1/2', '.5', '5', '9 to 11', '9 to 9 and 1/2', '11/12/20']
Please note the difference between 5 and .5, and 'x to y' and 'x to y and z'
I would appreciate any help. Thank you.
You can use
import re
def extract_values(sentence):
num = r'\d*\.?\d+(?:/\d*\.?\d+)*'
return re.findall(fr'{num}(?:\s+(?:and|to)\s+{num})*', sentence)
print(extract_values("He is 1 and 1/2 years old. He is .5 years old and he is 5 years old. He is between 9 to 11 or 9 to 9 and 1/2. He was born 11/12/20"))
# => ['1 and 1/2', '.5', '5', '9 to 11', '9 to 9 and 1/2', '11/12/20']
See the Python demo, and the regex demo.
Details:
\d*\.?\d+(?:/\d*\.?\d+)* - a float/int number, and then zero or more occurrences of / and a float/int number(?:\s+(?:and|to)\s+\d*\.?\d+(?:/\d*\.?\d+)*)* - zero or more occurrences of
\s+(?:and|to)\s+ - and or to enclosed with one or more whitespaces\d*\.?\d+(?:/\d*\.?\d+)* - a float/int number, and then zero or more occurrences of / and a float/int number.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With