Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression for Acronyms

Tags:

regex

I am currently trying to exctract acronmys from an bunch of documents.

Say the documents contains "Static application security testing (SAST)"

So I am trying to create a regex for filtering out these kind of strings. It should probably be something like

"a number of words whose initial letter is later repeated in the braces."

Unfortunately my regex is not very good to formulate this. Do you folks think it can be done via regex at all or do I need something more powerful like a CFG-based parser?

like image 406
er4z0r Avatar asked Jan 04 '11 12:01

er4z0r


2 Answers

Try this (for 2 letter acronyms):

\b(\w)\w+\s+\b(\w)\w+\s+\(\1\2\)

This for 3 letter acronyms:

\b(\w)\w+\s+\b(\w)\w+\s+\b(\w)\w+\s+\(\1\2\3\)

This for 4 letter acronyms:

\b(\w)\w+\s+\b(\w)\w+\s+\b(\w)\w+\s+\b(\w)\w+\s+\(\1\2\3\4\)

Please note that the regex needs to be case insensitive.

BTW the Regex Coach is a nice tool for trying out stuff like this.

like image 115
Helge Klein Avatar answered Nov 25 '22 12:11

Helge Klein


Here are two Perl solutions: The first one goes word by word, constructing an array made by the first leter of every word, then removes the acronym formed by those leters. It's fairly weak, and should fail if there's more than just the acronym and the letters per line - It also makes use of the (??{}) pattern to insert the acronym into the regex, which makes me queasy:

use strict;
use warnings;
use 5.010;

$_ = "Static application security testing (SAST)";

my @first;
s/
   \b
    (?<first>\p{L})\p{L}*
   \b
(?{ push @first, $+{first} })
  \K \s+ \(
    (??{ join '', map { uc } @first; })
    \)
//gx;

say;

Meanwhile, this solution first checks for something like an acronym, then constructs a regex to match as many words necessary: $_ = "Static application security testing (SAST)";

my ($possible_acronym) = /\((\p{Lu}+)\)/;
my $regex = join '', map({ qr/\b(?i:$_)\p{L}*\b\s*?/ } split //, $possible_acronym), qr/\K\Q($possible_acronym)/;
s/$regex//;

say;

(I tried making a solution using (?(DEFINE)) patterns, such as tchrist's answer here, but failed miserably. Oh well.)

For more about (?:), named captures (?), \K, and a whole bunch of swell stuff, perlre is the answer.

like image 40
Hugmeir Avatar answered Nov 25 '22 10:11

Hugmeir