What perl regex matches a "word" in the following filename?
I have a series of file names in which some words appear more than once:
john_smith_on_alaska_trip_john_smith_0001.jpg
His wife's name is Olga, with an umlaut over the o, and there are a few other names with diacritics; all lower case, in my situation, but not simply English a-z. The .jpg has been temporarily stripped off for other reasons, and may be ignored for this discussion.
I want to remove the duplicate names/words. Something like this works fine in emacs:
s/(\b\w{3,}\b)(.*)(\b\1\b)/\1\2/
Run it once, the above turns to: john_smith_on_alaska_trip__smith_0001.jpg
Again: john_smith_on_alaska_trip___0001.jpg
In Perl, this does not work because \w includes the _ as a word character.
Worse yet - the anchor, \b is anything other than those characters, and therefore doesn't separate on _.
My current solution is to replace all _ with , do the deed, and revert. But, this seems such a fundamental requirement, I feel I must be missing something.
Thank you.
Use the Character Class \p{Alpha} and Lookbehind and Lookahead assertions in place of word boundaries to ensure that the each word is a whole word instead of a substring:
use strict;
use warnings;
my $file = "john_smith_on_alaska_trip_john_smith_0001_johnsmith.jpg";
1 while $file =~ s{
(?<!\p{Alpha}) ( \p{Alpha}++ ) # Word surrounded by non-word chars
.* \K # Keep everything before this point
(?<!\p{Alpha}) \1 (?!\p{Alpha}) # Strip duplicate word
}{}x;
print "$file\n";
Outputs:
john_smith_on_alaska_trip___0001_johnsmith.jpg
Live Demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With