Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract all numeric like values from string?

I have a strings that contain different values (numeric and non-numeric mixed). I want to be able to extract the values from the text. I could not get my head around how to extract all (or most of) possible cases. I have a partially working solution like this,

def extract_values(sentence):
    #sentence = normalizeString(sentence)
    matches = re.findall(r'((\d*\.?\d+(?:\/\d*\.?\d+)?)(?:\s+and\s+(\d*\.?\d+(?:\/\d*\.?\d+)?))?)', sentence)    
    # (\d\sto\s\d\s(and\s\d\/\d)*) << for adding 9 to 11, couldn't fix

    result = []
    for x,y,z in matches:
        if '/' in x:
            result.append(x)
        else:
            result.extend(filter(lambda x: x!="", [y,z]))
    return result

Driver code,

extract_values("He is 1 and 1/2 years old. He is .5 years old and he is 5 years old. He is between 9 to 11 or 9 to 9 and 1/2. He was born 11/12/20")

Incorrect answer:

['1 and 1/2', '5', '5', '9', '11', '9', '9 and 1/2', '11/12', '20']

Expected answer:

['1 and 1/2', '.5', '5', '9 to 11', '9 to 9 and 1/2', '11/12/20']

Please note the difference between 5 and .5, and 'x to y' and 'x to y and z'

I would appreciate any help. Thank you.

like image 807
Droid-Bird Avatar asked Jun 26 '26 10:06

Droid-Bird


1 Answers

You can use

import re

def extract_values(sentence):
   num = r'\d*\.?\d+(?:/\d*\.?\d+)*'
   return re.findall(fr'{num}(?:\s+(?:and|to)\s+{num})*', sentence)

print(extract_values("He is 1 and 1/2 years old. He is .5 years old and he is 5 years old. He is between 9 to 11 or 9 to 9 and 1/2. He was born 11/12/20"))
# => ['1 and 1/2', '.5', '5', '9 to 11', '9 to 9 and 1/2', '11/12/20']

See the Python demo, and the regex demo.

Details:

  • \d*\.?\d+(?:/\d*\.?\d+)* - a float/int number, and then zero or more occurrences of / and a float/int number
  • (?:\s+(?:and|to)\s+\d*\.?\d+(?:/\d*\.?\d+)*)* - zero or more occurrences of
    • \s+(?:and|to)\s+ - and or to enclosed with one or more whitespaces
    • \d*\.?\d+(?:/\d*\.?\d+)* - a float/int number, and then zero or more occurrences of / and a float/int number.
like image 155
Wiktor Stribiżew Avatar answered Jun 29 '26 01:06

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!