Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex to match age written in textual format

Tags:

python

regex

I am trying to extract the age of a person from a sentence; this is a bit simplified, but it's all for a research project. I know that in the sentence the age is always preceded by either a colon followed by 0 or more spaces, or a colon, spaces, a few words, and some spaces (example: "character: a lovely eighty year old grandma", I want a regex that will allow me to extract 'eighty' from one of the groups). I am using python's 're' library and my code hangs on this example (code and example below):

regex_age_string = r'([:]*[ ]*)?((([a-z]*)([ -]*))+)([ -]+)(year)'
regex_age_string = re.compile(regex_age_string, re.DOTALL)
sentence = 'history:   four year-old boy was really sad when he found 
out the toy was broken'
age_extract_string = re.search(regex_age_string, sentence)
print(age_extract_string.group())
print(age_extract_string.group(2))

However, the works when I shorten the sentence by cutting out a few of the tail words. I read up about regex searches hanging because of catastrophic backtracking but I am not sure how that applies here/how to fix it.

like image 550
Maria Avatar asked Dec 18 '25 22:12

Maria


1 Answers

The reason your regex causes slowdown is catastrophic backtracking. It is caused by a sequence of optional patterns inside a quantified group - (([a-z]*)([ -]*))+.

You may actually match any letters, spaces or hyphens from a : till year:

r':\s*([a-z\s-]*?)\s*-*year'

See the regex demo.

Details

  • : - a :
  • \s* - 0+ whitespacves
  • ([a-z\s-]*?) - Group 1: 0+ lowercase ASCII letters, whitespaces or hyphens
  • \s* - 0+ whitespaces
  • -* - 0+ - chars
  • year - a substring.
like image 184
Wiktor Stribiżew Avatar answered Dec 21 '25 14:12

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!