Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to extract acronyms

I am using regex to extract acronyms(only specific types) from text in python.

  • ABC (all caps within round brackets or square brackets or between word endings)
  • A.B.C (same as above but having only one '.' in between)
  • A&B&C (same as above but having only one '&' in between)

So far I am using

text = "My name is STEVE. My friend works at (I.A.). Indian Army(IA). B&W also B&&W Also I...A"
re.findall('\\b[A-Z][A-Z.&]{2,7}\\b', text)

Output is : ['STEVE', 'I.A', 'B&W', 'B&&W', 'I...A']
I want to exclude B&&W and I..A, but include (IA). 

I am aware of the below links but I am unable to use them correctly. Kindly help.

Extract acronyms patterns from string using regex

Finding Acronyms Using Regex In Python

RegEx to match acronyms

like image 874
Prince Avatar asked Nov 05 '18 06:11

Prince


1 Answers

What you want is a capital followed by a bunch of capitals, with optional dots or ampersands in between.

re.findall('\\b[A-Z](?:[\\.&]?[A-Z]){1,7}\\b', text)

Breaking it down:

  • All back slashes are doubled because they need escaping
  • \b word border
  • [A-Z] capital
  • (?: opening a non-capturing group
  • [\.&] character class containing . and &
  • ? optional
  • [A-Z] followed by another capital
  • ) closing non-capturing group of an optional . or &, followed by a capital
  • {1,7} repeating that group 1 - 7 times
  • \b word border

We want a non-capturing group since re.findall returns groups (if present).

There are better ways of matching capitals that work across all of the Unicode characters.

This does match B&WW and B&W.W, since we do not enforce the use of the (same) character every time. If you want that, the expression gets a bit more complex (though not much).

like image 93
SQB Avatar answered Oct 09 '22 07:10

SQB