Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I universally eliminate the underscore from the set of word characters in perl regex's?

Tags:

regex

perl

I know that I can just use [a-zA-Z0-9] or use [::alnum::] regex classes. But I would like to parse a lot of latex macros, which do not allow the '_' (and/or digits) in macro names and this can become very tedious very quickly, especially because I want to use the \b character a lot. the question header just mentions underscore, but it is really a more general question.

For example:

my $FOUNDNUM=(s/\\$known\b/\\$xltd{$known}/g);

Is it possible to change the set of characters in the word class once and for all?

I think the answer is no (I could not find a pragma or special variable), but I wanted to double check.

EDIT: Clarification:

my $b=qr/(?<![^a-zA-Z])/;

my $v= "Hi 1 Hi aHi Hia Hi123 Hi_3 _Hi_";

print "     In:\t'$v'\n";
print "Desired:\t'** 1 ** aHi Hia **123 **_3 _**_\n\n";
$_ = $v; print "".(s/([^a-zA-Z])Hi([^a-zA-Z])/$1**$2/g)." times to:\t'$_'\n";
$_ = $v; print "".(s/\bHi\b/**/g)." times to:\t'$_'\n";
$_ = $v; print "".(s/${b}Hi${b}/**/g)." times to:\t'$_'\n";

yields

     In:        'Hi 1 Hi aHi Hia Hi123 Hi_3 _Hi_'
Desired:        '** 1 ** aHi Hia **123 **_3 _**_

4 times to:     'Hi 1 ** aHi Hia **123 **_3 _**_'
2 times to:     '** 1 ** aHi Hia Hi123 Hi_3 _Hi_'
2 times to:     '** 1 Hi a** Hia Hi123 Hi_3 _Hi_'

the first pattern almost works (except at the start of the string), except it requires me to to use $1 and $2, specify the set of characters in the class.

the second pattern would have worked, except that it has underscore (and digits). nicely, it works at the start of the line.

the third pattern was an attempt to store a regex into a variable to abbreviate meaning, but it obviously failed.

like image 319
ivo Welch Avatar asked Nov 11 '22 14:11

ivo Welch


1 Answers

Best Solution comes from CasimiretHippolyte (Thank you!). While it is not possible to replace the '\b', we can define regex's upfront for zero-length assertions, one anchoring at the start and one anchoring at the end.

my $b1=qr/(?<![^\W_\d])/;
my $b2=qr/(?![^\W_\d])/;

my $v= "Hi 1 Hi aHi Hia Hi123 Hi_3 _Hi_ 3Hi";

print "     In:\t'$v'\n";
print "Desired:\t'** 1 ** aHi Hia **123 **_3 _**_ 3**\n\n";
$_ = $v; print "".(s/${b1}Hi${b2}/**/g)." times to:\t'$_'\n";
$_ = $v; print "".(s/(?<![^\W_\d])Hi(?![^\W_\d])/**/g)." times to:\t'$_'\n";
like image 124
ivo Welch Avatar answered Nov 15 '22 12:11

ivo Welch