The intent of this question is to provide an answer to the daily questions whose answer is "you have DOS line endings" so we can simply close them as duplicates of this one without repeating the same answers ad nauseam.
NOTE: This is NOT a duplicate of any existing question. The intent of this Q&A is not just to provide a "run this tool" answer but also to explain the issue such that we can just point anyone with a related question here and they will find a clear explanation of why they were pointed here as well as the tool to run so solve their problem. I spent hours reading all of the existing Q&A and they are all lacking in the explanation of the issue, alternative tools that can be used to solve it, and/or the pros/cons/caveats of the possible solutions. Also some of them have accepted answers that are just plain dangerous and should never be used.
Now back to the typical question that would result in a referral here:
I have a file containing 1 line:
what isgoingon
and when I print it using this awk script to reverse the order of the fields:
awk '{print $2, $1}' file
instead of seeing the output I expect:
isgoingon what
I get the field that should be at the end of the line appear at the start of the line, overwriting some text at the start of the line:
whatngon
or I get the output split onto 2 lines:
isgoingon
what
What could the problem be and how do I fix it?
The problem is that your input file uses DOS line endings of CRLF
instead of UNIX line endings of just LF
and you are running a UNIX tool on it so the CR
remains part of the data being operated on by the UNIX tool. CR
is commonly denoted by \r
and can be seen as a control-M (^M
) when you run cat -vE
on the file while LF
is \n
and appears as $
with cat -vE
.
So your input file wasn't really just:
what isgoingon
it was actually:
what isgoingon\r\n
as you can see with cat -v
:
$ cat -vE file
what isgoingon^M$
and od -c
:
$ od -c file
0000000 w h a t i s g o i n g o n \r \n
0000020
so when you run a UNIX tool like awk (which treats \n
as the line ending) on the file, the \n
is consumed by the act of reading the line, but that leaves the 2 fields as:
<what> <isgoingon\r>
Note the \r
at the end of the second field. \r
means Carriage Return
which is literally an instruction to return the cursor to the start of the line so when you do:
print $2, $1
awk will print isgoingon
and then will return the cursor to the start of the line before printing what
which is why the what
appears to overwrite the start of isgoingon
.
To fix the problem, do either of these:
dos2unix file
sed 's/\r$//' file
awk '{sub(/\r$/,"")}1' file
perl -pe 's/\r$//' file
Apparently dos2unix
is aka frodos
in some UNIX variants (e.g. Ubuntu).
Be careful if you decide to use tr -d '\r'
as is often suggested as that will delete all \r
s in your file, not just those at the end of each line.
Note that GNU awk will let you parse files that have DOS line endings by simply setting RS
appropriately:
gawk -v RS='\r\n' '...' file
but other awks will not allow that as POSIX only requires awks to support a single character RS and most other awks will quietly truncate RS='\r\n'
to RS='\r'
. You may need to add -v BINMODE=3
for gawk to even see the \r
s though as the underlying C primitives will strip them on some platforms, e.g. cygwin.
One thing to watch out for is that CSVs created by Windows tools like Excel will use CRLF
as the line endings but can have LF
s embedded inside a specific field of the CSV, e.g.:
"field1","field2.1
field2.2","field3"
is really:
"field1","field2.1\nfield2.2","field3"\r\n
so if you just convert \r\n
s to \n
s then you can no longer tell linefeeds within fields from linefeeds as line endings so if you want to do that I recommend converting all of the intra-field linefeeds to something else first, e.g. this would convert all intra-field LFs
to tabs and convert all line ending CRLF
s to LF
s:
gawk -v RS='\r\n' '{gsub(/\n/,"\t")}1' file
Doing similar without GNU awk left as an exercise but with other awks it involves combining lines that do not end in CR
as they're read.
Also note that though CR is part of the [[:space:]]
POSIX character class, it is not one of the whitespace characters included as separating fields when the default FS of " "
is used, whose whitespace characters are only tab, blank, and newline. This can lead to confusing results if your input can have blanks before CRLF:
$ printf 'x y \n'
x y
$ printf 'x y \n' | awk '{print $NF}'
y
$
$ printf 'x y \r\n'
x y
$ printf 'x y \r\n' | awk '{print $NF}'
$
That's because trailing field separator white space is ignored at the beginning/end of a line that has LF line endings, but \r
is the final field on a line with CRLF line endings if the character before it was whitespace:
$ printf 'x y \r\n' | awk '{print $NF}' | cat -Ev
^M$
You can use the \R
shorthand character class in PCRE for files with unknown line endings. There are even more line ending to consider with Unicode or other platforms. The \R
form is a recommended character class from the Unicode consortium to represent all forms of a generic newline.
So if you have an 'extra' you can find and remove it with the regex s/\R$/\n/
will normalize any combination of line endings into \n
. Alternatively, you can use s/\R/\n/g
to capture any notion of 'line ending' and standardize into a \n
character.
Given:
$ printf "what\risgoingon\r\n" > file
$ od -c file
0000000 w h a t \r i s g o i n g o n \r \n
0000020
Perl and Ruby and most flavors of PCRE implement \R
combined with the end of string assertion $
(end of line in multi-line mode):
$ perl -pe 's/\R$/\n/' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
$ ruby -pe '$_.sub!(/\R$/,"\n")' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
(Note the \r
between the two words is correctly left alone)
If you do not have \R
you can use the equivalent of (?>\r\n|\v)
in PCRE.
With straight POSIX tools, your best bet is likely awk
like so:
$ awk '{sub(/\r$/,"")} 1' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
Things that kinda work (but know your limitations):
tr
deletes all \r
even if used in another context (granted the use of \r
is rare, and XML processing requires that \r
be deleted, so tr
is a great solution):
$ tr -d "\r" < file | od -c
0000000 w h a t i s g o i n g o n \n
0000016
GNU sed
works, but not POSIX sed
since \r
and \x0D
are not supported on POSIX.
GNU sed only:
$ sed 's/\x0D//' file | od -c # also sed 's/\r//'
0000000 w h a t \r i s g o i n g o n \n
0000017
The Unicode Regular Expression Guide is probably the best bet of what the definitive treatment of what a "newline" is.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With