How do I recognize a "text" word via regex?

Question

What perl regex matches a "word" in the following filename?

I have a series of file names in which some words appear more than once:

john_smith_on_alaska_trip_john_smith_0001.jpg

His wife's name is Olga, with an umlaut over the o, and there are a few other names with diacritics; all lower case, in my situation, but not simply English a-z. The .jpg has been temporarily stripped off for other reasons, and may be ignored for this discussion.

I want to remove the duplicate names/words. Something like this works fine in emacs:

s/(\b\w{3,}\b)(.*)(\b\1\b)/\1\2/

Run it once, the above turns to: john_smith_on_alaska_trip__smith_0001.jpg

Again: john_smith_on_alaska_trip___0001.jpg

In Perl, this does not work because \w includes the _ as a word character. Worse yet - the anchor, \b is anything other than those characters, and therefore doesn't separate on _.

My current solution is to replace all _ with , do the deed, and revert. But, this seems such a fundamental requirement, I feel I must be missing something.

Thank you.

Miller · Accepted Answer

Use the Character Class \p{Alpha} and Lookbehind and Lookahead assertions in place of word boundaries to ensure that the each word is a whole word instead of a substring:

use strict;
use warnings;

my $file = "john_smith_on_alaska_trip_john_smith_0001_johnsmith.jpg";

1 while $file =~ s{
    (?<!\p{Alpha}) ( \p{Alpha}++ )     # Word surrounded by non-word chars
    .* \K                              # Keep everything before this point
    (?<!\p{Alpha}) \1 (?!\p{Alpha})    # Strip duplicate word 
}{}x;

print "$file
";

Outputs:

john_smith_on_alaska_trip___0001_johnsmith.jpg

Live Demo

How do I recognize a "text" word via regex?

Tags:

regex

perl

Matthew Miner

1 Answers

Miller

Recent Activity

Donate For Us

How do I recognize a "text" word via regex?

Tags:

regex

perl

Matthew Miner

1 Answers

Miller

Related questions

Recent Activity

Donate For Us