I am using regex to extract acronyms(only specific types) from text in python.
So far I am using
text = "My name is STEVE. My friend works at (I.A.). Indian Army(IA). B&W also B&&W Also I...A"
re.findall('\\b[A-Z][A-Z.&]{2,7}\\b', text)
Output is : ['STEVE', 'I.A', 'B&W', 'B&&W', 'I...A']
I want to exclude B&&W and I..A, but include (IA).
I am aware of the below links but I am unable to use them correctly. Kindly help.
Extract acronyms patterns from string using regex
Finding Acronyms Using Regex In Python
RegEx to match acronyms
What you want is a capital followed by a bunch of capitals, with optional dots or ampersands in between.
re.findall('\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b', text)
Breaking it down:
\b
word border[A-Z]
capital(?:
opening a non-capturing group[\.&]
character class containing .
and &
?
optional[A-Z]
followed by another capital)
closing non-capturing group of an optional .
or &
, followed by a capital{1,7}
repeating that group 1 - 7 times\b
word borderWe want a non-capturing group since re.findall
returns groups (if present).
There are better ways of matching capitals that work across all of the Unicode characters.
This does match B&WW
and B&W.W
, since we do not enforce the use of the (same) character every time. If you want that, the expression gets a bit more complex (though not much).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With