Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

$ and Perl's global regular expression modifier

Tags:

regex

perl

I finally figured out how to append text to the end of each line in a file:

perl -pe 's/$/addthis/' myfile.txt

However, as I'm trying to learn Perl for frequent regex use, I can't figure out why is it that the following perl command adds the text 'addthis' to the end and start of each line:

perl -pe 's/$/addthis/g' myfile.txt

I thought that '$' matched the end of a line no matter what modifier was used for the regex match, but I guess this is wrong?

like image 672
drapkin11 Avatar asked Feb 18 '13 16:02

drapkin11


People also ask

What are regex modifiers?

The m flag, or modifier, is called the multiline flag. It is used to match the beginning (^) or end ($) of each line, delimited by \n (new line character) or \r (carriage return character), not only the very beginning or end of the input string as a whole.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.

What does .*? Mean in regex?

(. *?) matches any character ( . ) any number of times ( * ), as few times as possible to make the regex match ( ? ). You'll get a match on any string, but you'll only capture a blank string because of the question mark.

What does $1 do in regex?

The $ number language element includes the last substring matched by the number capturing group in the replacement string, where number is the index of the capturing group. For example, the replacement pattern $1 indicates that the matched substring is to be replaced by the first captured group.


2 Answers

Summary: For what you're doing, drop the /g so it only matches before the newline. The /g is telling it to match before the newline and at the end of the string (after the newline).

Without the /m modifier, $ will match either before a newline (if it occurs at the end of the string) or at the end of the string. For instance, with both "foo" and "foo\n", the $ would match after foo. With "foo\nbar", though, it would match after bar, because the embedded newline isn't at the end of the string.

With the /g modifier, you're getting all the places that $ would match -- so

s/$/X/g;

would take a line like "foo\n" and turn it into "fooX\nX".

Sidebar: The /m modifier will allow $ to match newlines that occur before the end of the string, so that

s/$/X/mg;

would convert "foo\nbar\n" into "fooX\nbarX\nX".

like image 88
Jim Davis Avatar answered Nov 03 '22 01:11

Jim Davis


As Jim Davis pointed out, $ matches both the end of the string, or before the \n character (with the /m option). (See the Regular Expressions section of the perlre Perldoc page. Using the g modifier allowed it to continue matching.

Multiple line Perl regular expressions (i.e., Perl regular expressions with the new line character in them even if it only occurs once at the end of the line) causes all sorts of complications that most Perl programmers have issues handling.

  • If you're reading in a file one line at a time, always use chomp before doing ANYTHING with that line. This would have solved your issue when using the g qualifier.

  • Further issues can happen if you're reading files on Linux/Mac which came from Windows. In that case, you will have both the \r and \n character. As I found out recently in attempting to debug a program, the \r character isn't removed by chomp. I now make sure I always open my text files for reading

Like this:

open my $file_handle, "<:crlf", $file...

This will automatically substitute the \r\n characters with just \n if this is in fact a Windows file on a Linux/Mac system. If this is a regular Linux/Mac text file, it will do nothing. Other obvious solution is not to use Windows (rim shot!).

Of course, in your case, using chomp first would have done the following:

$cat file
line one
line two
line three
line four
$ perl -pe 'chomp;s/$/addthis::/g`
line oneaddthis::line twoaddthis::line threeaddthis::line fouraddthis::

The chomp removed the \n, so now, you don't see it when the line print out. Hmm...

$ perl -pe 'chomp;s/$/addthis/g;print "\n";
line oneaddthis
line twoaddthis
line threeaddthis
line fouraddthis

That works! And, your one liner is only mildly incomprehensible.


The other thing is to take a more modern approach that Damian Conway recommends in Chapter 12 of his book Perl Best Practices:

Use \A and \z as string boundary anchors.

Even if you don’t adopt the previous practice of always using /m, using ^ and $ with their default meanings is a bad idea. Sure, you know what ^ and $ actually mean in a Perl regex1. But will those who read or maintain your code know? Or is it more likely that they will misinterpret those metacharacters in the ways described earlier? Perl provides markers that always—and unambiguously—mean “start of string” and “end of string”: \A and \z (capital A, but lowercase z). They mean “start/end of string” regardless of whether /m is active. They mean “start/end of string” regardless of what the reader thinks ^ and $ mean.

If you followed Conaway's advice, and did this:

perl -pe 's/\z/addthis/mg' myfile.txt

You would see that your phrase addthis got added to only to the end of each and every line:

$cat file
line one
line two
line three
line four
$ perl -pe `s/\z/addthis/mg` myfile.txt
line one
addthisline two
addthisline three
addthisline four
addthis

See how well that works. That addthis was added to the very end of each line! ...Right after the \n character on that line.

Enough fun and back to work. (Wait, it's President's Day. It's a paid holiday. No work today except of course all that stuff I promised to have done by Tuesday morning).

Hope this helped you understand how much fun regular expressions are and why so many people have decided to learn Python.


1. Know what ^ and $ really mean in Perl? Uh, yes of course I do. I've been programming in Perl for a few decades. Yup, I know all this stuff. (Note to self: $ apparently doesn't mean what I always thought it meant.)

like image 37
David W. Avatar answered Nov 02 '22 23:11

David W.