Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is this regex substitution "$content =~ s/\n-- \n.*?$//s" actually doing?

Tags:

regex

perl

rt

I am working through some Perl code in Request Tracker 4.0 and have encountered an error where ticket requestor's message is cut off. I am new to Perl, I have done some work with regular expressions, but I'm having some trouble with this one even after reading quite a bit.

I have narrowed my problem down to this line of code:

$content =~ s/\n-- \n.*?$//s

I don't fully understand what it is doing and would like a better explanation.

I understand that s/ / is matching the pattern \n-- \n.*?$ and replacing it with nothing.

I don't understand what .*?$ does. Here is my basic understanding:

  • . is any character except \n
  • * is 0 or more times of the preceding character
  • ? is 0 or 1 times of the preceding character
  • $ is the end of the string

Then, from what I understand, the final s makes the . match new lines

So, roughly, we're replacing any text beginning with \n-- \n - this line of code is causing some questionable behavior that I'd love to get sorted out if someone can explain what's going on here.

Can someone explain what this line is doing? Is it just removing all text after the first \n-- \n or is there more to it?

Long winded part / real-life issue (you don't need to read this to answer the question)

My exact problem is that it is cutting the quoted content at the signature.

So if email A from a customer says:

What is going on with order ABCD?
-- Some Customer

The staff reply says (note the loss of the customer's signature)

It is shipping today

What is going on with order ABCD?

The customer replies

I did not get it, it did not ship!!!
-- Some Customer

It is shipping today

What is going on with order ABCD?

When we reply, their message will cut at the -- which kills all the context.

It shipped today, tracking number 12345

I did not get it, it did not ship!!!

And leads to more work explaining what order it is, etc.

like image 602
candyman Avatar asked Aug 07 '13 19:08

candyman


2 Answers

You're almost correct: it removes everything from the last occurrence of "\n-- \n" to the end. That this doesn't remove everything from the first occurrence is due to the non-greedyness operator ? -- it tells the regex engine to match the shortest postsible form of the preceding pattern (.*).

What this does: In email communication the signature is usually separated from the message body by exactly this pattern: a line consisting of exactly two dashes and a single trailing space. Therefore what the regex does is remove everything beginning with the signature separator to the end.

Now what your customer does (either manually or his email client) is add the quoted reply of the email after the signature separator. This is highly unusual: the quoted reply must be located before the signature modifier. I don't know of a single email client that does this on purpose, but alas there are tons of programs out there that simply get email from (from charset issues over quoting to SMTP non-conformance you can make an incredible number of mistakes), so I wouldn't be surprised to learn that there are indeed such clients.

Another possibility is that this is an affectation of the client -- like signing his own name after --. However, I suspect this is not done manually as humans seldom insert a trailing space after two dashes followed by a line break.

like image 193
Moritz Bunkus Avatar answered Sep 28 '22 08:09

Moritz Bunkus


When ? follows a quantifier (?, *, + or {m,n}), it modifies the greediness of that quantifier[1]. Normally, these quantifiers match the most characters as possible, but with ?, they match the fewest.

say "Greedy:     ", "abc1234" =~ /\w(.*)\d/;
say "Non-greedy: ", "abc1234" =~ /\w(.*?)\d/;

Output:

bc123
bc

Since there two places $ can match (before a trailing newline or at the end of the string), this has the following effect:

$_ = "abc\n-- \ndef\n";
say "Greedy:     <<" . s/\n-- \n.*$//sr  . ">>";
say "Non-greedy: <<" . s/\n-- \n.*?$//sr . ">>";

Output:

Greedy:     <<abc>>
Non-greedy: <<abc
>>

It ensures the newline terminating the last line isn't removed. The following are more straightforward equivalents:

s/\n-- \n.*/\n/s

s/(?<=\n)-- \n.*//s   # Slow

s/\n\K-- \n.*//s      # Requires 5.10

Note that it will remove starting with the first --.

$ perl -E'say "abc\n-- \ndef\n-- \nghi\n" =~ s/\n-- \n.*?$//sr'
abc

If you want to start removing from the last, you'll have to replace .* with something guaranteed not to match --.

$ perl -E'say "abc\n-- \ndef\n-- \nghi\n" =~ s/\n-- \n(?:(?!-- \n).)*?$//sr'
abc
-- 
def

Notes:

  1. It also has the same meaning if it follows another quantifier modifier (e.g. /.*+?/).
like image 39
ikegami Avatar answered Sep 28 '22 09:09

ikegami