Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to extract number before/after word

Tags:

regex

stata

I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.

For example:

"police arrests 4 people"
"7 people were arrested". 

The numbers range from 1-99.

I have tried the following code:

gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")

I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.

like image 582
serpentina Avatar asked Mar 06 '23 02:03

serpentina


1 Answers

You can use this regex:

(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))

It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.

It creates a non capturing Group, that matches a number from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter and Space (the other Words) before it matches 'arrests OR arrested. It then ORs that with the opposite situation (where the number comes last).

This will match, if the number is within 20 chars from 'arrests|arrested'.

like image 154
Poul Bak Avatar answered Mar 13 '23 03:03

Poul Bak