Why does this line numbering command mangle the character encoding?

Question

I'd like to modify a file by adding line numbers to the beginning of each line. I've found that the following command does this:

cat file | perl -pe '$_ = "$. $_"' > file_with_line_numbers

This seems to work, however, when I open the file in vim it's full of ^@ and ^M characters. Further investigation shows that the encoding has changed.

> file -bi file
text/plain; charset=utf-16le

> file -bi file_with_line_numbers
application/octet-stream; charset=binary

What am I missing here?

hobbs · Accepted Answer

Because you're not decoding your input data and you're not encoding your output data, and by concatenating $. with $_ you're mixing data that are in two different encodings (rather, you're mixing a byte-string and a character string, but perl is implicitly converting the byte string to a character string, and doing it in a very wrong way for what you need).

One fix would be:

perl -pe  'BEGIN { binmode STDIN, ":encoding(utf16le)"; binmode STDOUT, ":encoding(utf16le)" } $_ = "$. $_";' < input > output

One fix would be:

perl -pe  'BEGIN { binmode STDIN, ":encoding(utf16le)"; binmode STDOUT, ":encoding(utf16le)" } $_ = "$. $_";' < input > output

ikegami · Answer

You need to decode your program's input and encode your program's output.

As ysth points out, this will do the trick (except on Windows, but probably using cygwin):

perl -Mopen=:std,':encoding(utf-16le)' -pe'$_="$. $_";' file.in >file.out

Rest of original answer:

This is easiest done if you have UTF-8, since you can then use -CSDA.

<file.in iconv -f UTF-16le -t UTF-8 \
   | perl -CSDA -pe'$_="$. $_";' \
     | iconv -f UTF-8 -t UTF-16le \
       >file.out

Due to properties of UTF-8, you can get away without decoding/encoding completely in this case, allowing you to use either of the following:

<file.in iconv -f UTF-16le -t UTF-8 \
   | perl -pe'$_="$. $_";' \
     | iconv -f UTF-8 -t UTF-16le \
       >file.out

or

<file.in iconv -f UTF-16le -t UTF-8 \
   | nl \
     | iconv -f UTF-8 -t UTF-16le \
       >file.out

Why does this line numbering command mangle the character encoding?

Tags:

linux

encoding

perl

cachance7

2 Answers

hobbs

ikegami

Recent Activity

Donate For Us

Why does this line numbering command mangle the character encoding?

Tags:

linux

encoding

perl

cachance7

2 Answers

hobbs

ikegami

Related questions

Recent Activity

Donate For Us