Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make a perl one-liner "line-endings agnostic"

I have scratched my head for one hour on a perl oneliner failing because the file had CRLF line endings. It has a regex with group match at the end of the line, and the CR got included in the match, making bad stuff with using the backreference for replace.

I ended up specifying the CRLF manually in the regex, but is there a way to get perl handle automatically line-ending whatever they are?

Original command is

perl -pe  's/foo bar(.*)$/foo $1 bar/g' file.txt

"Correct" command is

perl -pe  's/foo bar(.*)\r\n/foo $1 bar\r\n/g' file.txt

I know I can also convert line endings before processing, I'm interested in how to get Perl handle this case gracefully.

Example file (save with CRLF line endings!)

[19:06:57.033] foo barmy
[19:06:57.033] foo baryour

Expected output

[19:06:57.033] foo my bar
[19:06:57.033] foo your bar

Output with original command (bar goes at line beginning because it's matched together with carriage return):

bar:06:57.033] foo my
bar:06:57.033] foo your
like image 976
CharlesB Avatar asked Oct 30 '13 12:10

CharlesB


4 Answers

is there a way to get perl handle automatically platform-specific line-ending?

Yes. It's actually the default.

The issue is that you're trying to handle Windows line endings on a unix platform.

This will definitely do it:

perl -pe'
    BEGIN {
       binmode STDIN,  ":crlf";
       binmode STDOUT, ":crlf";
    }
    s/foo bar(.*)$/foo $1 bar/g;
' <file.txt

Might I suggest you keep doing it manually?

Alternatively, you could convert the file to a text file and convert it back.

<file.orig dos2unix | perl -pe'...' | unix2dos >file.new
like image 105
ikegami Avatar answered Nov 11 '22 19:11

ikegami


In newer perls, you can use \R in your regex to strip off all end-of-line characters (it includes both \n and \r). See perldoc perlre.

like image 45
Ether Avatar answered Nov 11 '22 20:11

Ether


The \R escape sequence Perl v5.10+; see perldoc rebackslash or the documentation online, which matches "generic newlines" (platform-agnostically) can be made to work here (example uses Bash to create the multi-line input string):

$ printf 'foo barmy\r\nfoo baryour\r\n' | perl -pe 's/foo bar(.*?)\R/foo $1 bar\n/gm'
foo my bar
foo your bar

Note that the only difference to Ether's answer is use of a non-greedy construct (.*? rather than just .*), which makes all the difference here.

Read on, if you want to know more.


Background:

It is an example of a pitfall associated with \R, which stems from the fact that it can match one or two characters - either \r\n or, typically, \n:[1]

With the greedy (.*) construct , "my\r" - including the \r - is captured, because the regex engine apparently only backtracks by one character to look for \R, which the remaining \n by itself also satisfies.

By contrast, using the non-greedy (.*?) construct causes \R to match the \r\n sequence, as intended.

[1] \R matches MORE than just \r\n and \n: it matches any single character that is classified as vertical whitespace in Unicode terms, which also includes \v (vertical tab), \f (form feed), \r (by itself), and the following Unicode chars: 0x133 (NEXT LINE), 0x2028 (LINE SEPARATOR), 0x8232 (LINE SEPARATOR) and 0x8233 (PARAGRAPH SEPARATOR)

like image 28
mklement0 Avatar answered Nov 11 '22 19:11

mklement0


You can say:

perl -pe 's/foo bar([^\015]*)(\015?\012)/foo $1 bar$2/g' *.txt

The line endings would be preserved, i.e. would be the same as the input file.


You might also want to refer to perldoc perlport.

like image 1
devnull Avatar answered Nov 11 '22 21:11

devnull