Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Powershell file concatenation convert UTF8 to UTF16?

I am running the following Powershell script to concatenate a series of output files into a single CSV file. whidataXX.htm (where xx is a two digit sequential number) and the number of files created varies from run to run.

$metadataPath = "\\ServerPath\foo" 

function concatenateMetadata {
    $cFile = $metadataPath + "whiconcat.csv"
    Clear-Content $cFile
    $metadataFiles = gci $metadataPath
    $iterations = $metadataFiles.Count
    for ($i=0;$i -le $iterations-1;$i++) {
        $iFile = "whidata"+$i+".htm"
        $FileExists = (Test-Path $metadataPath$iFile -PathType Leaf)
        if (!($FileExists))
        {
            break
        }
        elseif ($FileExists)
        {
            Write-Host "Adding " $metadataPath$iFile
            Get-Content $metadataPath$iFile | Out-File $cFile -append
            Write-Host "to" $cfile
        }
    }
} 

The whidataXX.htm files are encoded UTF8, but my output file is encoded UTF16. When I view the file in Notepad, it appears correct, but when I view it in a Hex Editor, the Hex value 00 appears between each character, and when I pull the file into a Java program for processing, the file prints to the console with extra spaces between c h a r a c t e r s.

First, is this normal for PowerShell? or is there something in the source files that would cause this?

Second, how would I fix this encoding problem in the code noted above?

like image 995
dwwilson66 Avatar asked Oct 15 '13 18:10

dwwilson66


People also ask

Does PowerShell use UTF 7 or UTF 8?

UTF7 Uses UTF-7. UTF8 Uses UTF-8 (with BOM). In general, Windows PowerShell uses the Unicode UTF-16LE encoding by default. However, the default encoding used by cmdlets in Windows PowerShell is not consistent. Using any Unicode encoding, except UTF7, always creates a BOM.

What is the difference between UTF8 and UTF16?

With this tool you can easily convert UTF8 data to UTF16 data. UTF8 and UTF16 are two different encodings. UTF8 uses a variable length encoding scheme that encodes each Unicode code point using one to four bytes but UTF16 is fixed at two or four bytes.

How do I convert a script to UTF-8?

Certainly, the easiest way is to load the script into notepad, then save it again with the UTF-8 encoding. It's an option in the Save As dialog box.. Show activity on this post. Perhaps with iconv? Show activity on this post.

What is the difference between a CSV file and UTF-8?

Without the UTF-8 part, the encoding of that text file—how you even tell what characters you have—is unspecified. UTF-8 specifies down at the level of bits in a stream, what those bits mean and how they translate to characters. They are two independent things, but can interact. CSV is an approach to storing structured data in a plain text file.


1 Answers

The Out-* cmdlets (like Out-File) format the data, and the default format is unicode.

You can add an -Encoding parameter to Out-file:

Get-Content $metadataPath$iFile | Out-File $cFile -Encoding UTF8 -append

or switch to Add-Content, which doesn't re-format

Get-Content $metadataPath$iFile | Add-Content $cFile 
like image 151
mjolinor Avatar answered Oct 21 '22 03:10

mjolinor