I am currently trying to exctract acronmys from an bunch of documents.
Say the documents contains "Static application security testing (SAST)"
So I am trying to create a regex for filtering out these kind of strings. It should probably be something like
"a number of words whose initial letter is later repeated in the braces."
Unfortunately my regex is not very good to formulate this. Do you folks think it can be done via regex at all or do I need something more powerful like a CFG-based parser?
Try this (for 2 letter acronyms):
\b(\w)\w+\s+\b(\w)\w+\s+\(\1\2\)
This for 3 letter acronyms:
\b(\w)\w+\s+\b(\w)\w+\s+\b(\w)\w+\s+\(\1\2\3\)
This for 4 letter acronyms:
\b(\w)\w+\s+\b(\w)\w+\s+\b(\w)\w+\s+\b(\w)\w+\s+\(\1\2\3\4\)
Please note that the regex needs to be case insensitive.
BTW the Regex Coach is a nice tool for trying out stuff like this.
Here are two Perl solutions: The first one goes word by word, constructing an array made by the first leter of every word, then removes the acronym formed by those leters. It's fairly weak, and should fail if there's more than just the acronym and the letters per line - It also makes use of the (??{}) pattern to insert the acronym into the regex, which makes me queasy:
use strict;
use warnings;
use 5.010;
$_ = "Static application security testing (SAST)";
my @first;
s/
\b
(?<first>\p{L})\p{L}*
\b
(?{ push @first, $+{first} })
\K \s+ \(
(??{ join '', map { uc } @first; })
\)
//gx;
say;
Meanwhile, this solution first checks for something like an acronym, then constructs a regex to match as many words necessary: $_ = "Static application security testing (SAST)";
my ($possible_acronym) = /\((\p{Lu}+)\)/;
my $regex = join '', map({ qr/\b(?i:$_)\p{L}*\b\s*?/ } split //, $possible_acronym), qr/\K\Q($possible_acronym)/;
s/$regex//;
say;
(I tried making a solution using (?(DEFINE)) patterns, such as tchrist's answer here, but failed miserably. Oh well.)
For more about (?:), named captures (?), \K, and a whole bunch of swell stuff, perlre is the answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With