Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does awk CR LF handling break on cygwin?

Tags:

linux

bash

awk

On Linux, this runs as expected:

$ echo -e "line1\r\nline2"|awk -v RS="\r\n" '/^line/ {print "awk: "$0}'
awk: line1
awk: line2

But under windows the \r is dropped (awk considers this one line):

Windows:

$ echo -e "line1\r\nline2"|awk -v RS="\r\n" '/^line/ {print "awk: "$0}'
awk: line1
line2

Windows GNU Awk 4.0.1 Linux GNU Awk 3.1.8

EDIT from @EdMorton (sorry if this is an unwanted addition but I think maybe it helps demonstrate the issue):

Consider this RS setting and input (on cygwin):

$ awk 'BEGIN{printf "\"%s\"\n", RS}' | cat -v
"
"
$ echo -e "line1\r\nline2" | cat -v
line1^M
line2

This is Solaris with gawk:

$ echo -e "line1\r\nline2" | awk '1' | cat -v   
line1^M
line2

and this is cygwin with gawk:

$ echo -e "line1\r\nline2" | awk '1' | cat -v
line1
line2

RS was just it's default newline so where did the control-M go in cygwin?

like image 874
jcalfee314 Avatar asked Jun 16 '14 20:06

jcalfee314


2 Answers

It seems like the issue is awk specific under Cygwin.
I tried a few different things and it seems that awk is silently treating replacing \r\n with \n in the input data.

If we simply ask awk to repeat the text unmodified, it will "sanitize" the carriage returns without asking:

$ echo -e "line1\r\nline2" | od -a
0000000   l   i   n   e   1  cr  nl   l   i   n   e   2  nl
0000015

$ echo -e "line1\r\nline2" | awk '{ print $0; }' | od -a
0000000   l   i   n   e   1  nl   l   i   n   e   2  nl
0000014

It will, however, leave other carriage returns intact:

$ echo -e "Test\rTesting\r\nTester\rTested" | awk '{ print $0; }' | od -a
0000000   T   e   s   t  cr   T   e   s   t   i   n   g  nl   T   e   s
0000020   t   e   r  cr   T   e   s   t   e   d  nl
0000033

Using a custom record separator of _ ended up leaving the carriage returns intact:

$ echo -e "Testing\r_Tested" | awk -v RS="_" '{ print $0; }' | od -a
0000000   T   e   s   t   i   n   g  cr  nl   T   e   s   t   e   d  nl
0000020  nl
0000021

The most telling example involves having \r\n in the data, but not as a record separator:

$ echo -e "Testing\r\nTested_Hello_World" | awk -v RS="_" '{ print $0; }' | od -a
0000000   T   e   s   t   i   n   g  nl   T   e   s   t   e   d  nl   H
0000020   e   l   l   o  nl   W   o   r   l   d  nl  nl
0000034

awk is blindly converting \r\n to \n in the input data even though we didn't ask it to.

This substitution seems to be happening before applying record separation, which explains why RS="\r\n" never matches anything. By the time awk is looking for \r\n, it's already substituted it with \n in the input data.

like image 36
Mr. Llama Avatar answered Nov 15 '22 09:11

Mr. Llama


I just checked with Arnold Robbins (the provider of gawk) and the answer is that it's something done by the C libraries and to stop it happening you should set the awk BINMODE variable to 3:

$ echo -e "line1\r\nline2" | awk '1' | cat -v
line1
line2

$ echo -e "line1\r\nline2" | awk -v BINMODE=3 '1' | cat -v
line1^M
line2

See the man page for more info if interested.

like image 132
Ed Morton Avatar answered Nov 15 '22 09:11

Ed Morton