I have scratched my head for one hour on a perl oneliner failing because the file had CRLF line endings. It has a regex with group match at the end of the line, and the CR got included in the match, making bad stuff with using the backreference for replace.
I ended up specifying the CRLF manually in the regex, but is there a way to get perl handle automatically line-ending whatever they are?
Original command is
perl -pe 's/foo bar(.*)$/foo $1 bar/g' file.txt
"Correct" command is
perl -pe 's/foo bar(.*)\r\n/foo $1 bar\r\n/g' file.txt
I know I can also convert line endings before processing, I'm interested in how to get Perl handle this case gracefully.
Example file (save with CRLF line endings!)
[19:06:57.033] foo barmy
[19:06:57.033] foo baryour
Expected output
[19:06:57.033] foo my bar
[19:06:57.033] foo your bar
Output with original command (bar goes at line beginning because it's matched together with carriage return):
bar:06:57.033] foo my
bar:06:57.033] foo your
is there a way to get perl handle automatically platform-specific line-ending?
Yes. It's actually the default.
The issue is that you're trying to handle Windows line endings on a unix platform.
This will definitely do it:
perl -pe'
BEGIN {
binmode STDIN, ":crlf";
binmode STDOUT, ":crlf";
}
s/foo bar(.*)$/foo $1 bar/g;
' <file.txt
Might I suggest you keep doing it manually?
Alternatively, you could convert the file to a text file and convert it back.
<file.orig dos2unix | perl -pe'...' | unix2dos >file.new
In newer perls, you can use \R
in your regex to strip off all end-of-line characters (it includes both \n
and \r
). See perldoc perlre.
The \R
escape sequence Perl v5.10+; see perldoc rebackslash
or the documentation online, which matches "generic newlines" (platform-agnostically) can be made to work here (example uses Bash to create the multi-line input string):
$ printf 'foo barmy\r\nfoo baryour\r\n' | perl -pe 's/foo bar(.*?)\R/foo $1 bar\n/gm'
foo my bar
foo your bar
Note that the only difference to Ether's answer is use of a non-greedy construct (.*?
rather than just .*
), which makes all the difference here.
Read on, if you want to know more.
Background:
It is an example of a pitfall associated with \R
, which stems from the fact that it can match one or two characters - either \r\n
or, typically, \n
:[1]
With the greedy (.*)
construct , "my\r"
- including the \r
- is captured, because the regex engine apparently only backtracks by one character to look for \R
, which the remaining \n
by itself also satisfies.
By contrast, using the non-greedy (.*?)
construct causes \R
to match the \r\n
sequence, as intended.
[1] \R
matches MORE than just \r\n
and \n
: it matches any single character that is classified as vertical whitespace in Unicode terms, which also includes \v
(vertical tab), \f
(form feed), \r
(by itself), and the following Unicode chars: 0x133 (NEXT LINE)
, 0x2028 (LINE SEPARATOR)
, 0x8232 (LINE SEPARATOR)
and 0x8233 (PARAGRAPH SEPARATOR)
You can say:
perl -pe 's/foo bar([^\015]*)(\015?\012)/foo $1 bar$2/g' *.txt
The line endings would be preserved, i.e. would be the same as the input file.
You might also want to refer to perldoc perlport
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With