Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Special characters become question marks after Command line find and replace

I have a text file input.xlf

  <trans-unit id="loco:5e7257a0c38e0f5b456bae94">
    &lt;source&gt;Login</source>
    <target>登入</target>
    <note>Login Header</note>
  </trans-unit>

Basically I need to replace &lt; with < and &gt; with '>', so I run below script

runner.bat

powershell -Command "(gc input.xlf) -replace '&lt;', '<' | Out-File -encoding ASCII output.xlf";
powershell -Command "(gc output.xlf) -replace '&gt;', '>' | Out-File -encoding ASCII  output.xlf";

The above was working until I noticed below as the output

  <trans-unit id="loco:5e7257a0c38e0f5b456bae94">
    <source>Login</source>
    <target>??????</target>
    <note>Login Header</note>
  </trans-unit>

I tried removing the encoding but now I get

 <trans-unit id="loco:5e7257a0c38e0f5b456bae94">
   <source>Login</source>
   <target>登入</target>
   <note>Login Header</note>  
 </trans-unit>

Below is my desired output

  <trans-unit id="loco:5e7257a0c38e0f5b456bae94">
    <source>Login</source>
    <target>登入</target>
    <note>Login Header</note>
  </trans-unit>
like image 788
Owen Kelvin Avatar asked Mar 02 '23 09:03

Owen Kelvin


1 Answers

There are (potentially) two character-encoding problems:

  • On output, using -Encoding Ascii is guaranteed to "lossily" transliterate any non-ASCII-range characters to literal ? characters.

    • To preserve all characters, you must choose a Unicode encoding, such as -Encoding Utf8
  • On input, you must ensure that the input file is correctly read by PowerShell.

    • Specifically, Windows PowerShell misinterprets BOM-less UTF-8 files as ANSI-encoded, so you need to use -Encoding Utf8 with Get-Content too.

Additionally, you can get away with a single powershell.exe call, and you can additionally optimize this call:

powershell -Command "(gc -Raw -Encoding utf8 input.xlf) -replace '&lt;', '<' -replace '&gt;', '>' | Set-Content -NoNewLine -Encoding Utf8 output.xlf"
  • Using -Raw with gc (Get-Content) reads the file as a whole instead of into an array of lines, which speeds up the -replace operations.

  • You can chain -replace operations

  • With input that is already text (strings), Set-Content is generally the faster choice.[1]
    -NoNewLine prevents an extra trailing newline from getting appended.


[1] It will make virtually no difference here, given that only a single string is written, but with many input strings (line-by-line output) it may - see this answer.

like image 132
mklement0 Avatar answered Mar 10 '23 12:03

mklement0