Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I recognize a "text" word via regex?

Tags:

regex

perl

What perl regex matches a "word" in the following filename?

I have a series of file names in which some words appear more than once:

john_smith_on_alaska_trip_john_smith_0001.jpg

His wife's name is Olga, with an umlaut over the o, and there are a few other names with diacritics; all lower case, in my situation, but not simply English a-z. The .jpg has been temporarily stripped off for other reasons, and may be ignored for this discussion.

I want to remove the duplicate names/words. Something like this works fine in emacs:

s/(\b\w{3,}\b)(.*)(\b\1\b)/\1\2/

Run it once, the above turns to: john_smith_on_alaska_trip__smith_0001.jpg

Again: john_smith_on_alaska_trip___0001.jpg

In Perl, this does not work because \w includes the _ as a word character. Worse yet - the anchor, \b is anything other than those characters, and therefore doesn't separate on _.

My current solution is to replace all _ with , do the deed, and revert. But, this seems such a fundamental requirement, I feel I must be missing something.

Thank you.

like image 702
Matthew Miner Avatar asked Nov 27 '25 04:11

Matthew Miner


1 Answers

Use the Character Class \p{Alpha} and Lookbehind and Lookahead assertions in place of word boundaries to ensure that the each word is a whole word instead of a substring:

use strict;
use warnings;

my $file = "john_smith_on_alaska_trip_john_smith_0001_johnsmith.jpg";

1 while $file =~ s{
    (?<!\p{Alpha}) ( \p{Alpha}++ )     # Word surrounded by non-word chars
    .* \K                              # Keep everything before this point
    (?<!\p{Alpha}) \1 (?!\p{Alpha})    # Strip duplicate word 
}{}x;

print "$file\n";

Outputs:

john_smith_on_alaska_trip___0001_johnsmith.jpg

Live Demo

like image 91
Miller Avatar answered Nov 28 '25 20:11

Miller