Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this line numbering command mangle the character encoding?

I'd like to modify a file by adding line numbers to the beginning of each line. I've found that the following command does this:

cat file | perl -pe '$_ = "$. $_"' > file_with_line_numbers

This seems to work, however, when I open the file in vim it's full of ^@ and ^M characters. Further investigation shows that the encoding has changed.

> file -bi file
text/plain; charset=utf-16le

> file -bi file_with_line_numbers
application/octet-stream; charset=binary

What am I missing here?

like image 234
cachance7 Avatar asked Dec 08 '22 18:12

cachance7


2 Answers

Because you're not decoding your input data and you're not encoding your output data, and by concatenating $. with $_ you're mixing data that are in two different encodings (rather, you're mixing a byte-string and a character string, but perl is implicitly converting the byte string to a character string, and doing it in a very wrong way for what you need).

One fix would be:

perl -pe  'BEGIN { binmode STDIN, ":encoding(utf16le)"; binmode STDOUT, ":encoding(utf16le)" } $_ = "$. $_";' < input > output
like image 152
hobbs Avatar answered Dec 11 '22 08:12

hobbs


You need to decode your program's input and encode your program's output.

As ysth points out, this will do the trick (except on Windows, but probably using cygwin):

perl -Mopen=:std,':encoding(utf-16le)' -pe'$_="$. $_";' file.in >file.out

Rest of original answer:

This is easiest done if you have UTF-8, since you can then use -CSDA.

<file.in iconv -f UTF-16le -t UTF-8 \
   | perl -CSDA -pe'$_="$. $_";' \
     | iconv -f UTF-8 -t UTF-16le \
       >file.out

Due to properties of UTF-8, you can get away without decoding/encoding completely in this case, allowing you to use either of the following:

<file.in iconv -f UTF-16le -t UTF-8 \
   | perl -pe'$_="$. $_";' \
     | iconv -f UTF-8 -t UTF-16le \
       >file.out

or

<file.in iconv -f UTF-16le -t UTF-8 \
   | nl \
     | iconv -f UTF-8 -t UTF-16le \
       >file.out
like image 45
ikegami Avatar answered Dec 11 '22 08:12

ikegami