I am running the following Powershell script to concatenate a series of output files into a single CSV file. whidataXX.htm
(where xx
is a two digit sequential number) and the number of files created varies from run to run.
$metadataPath = "\\ServerPath\foo"
function concatenateMetadata {
$cFile = $metadataPath + "whiconcat.csv"
Clear-Content $cFile
$metadataFiles = gci $metadataPath
$iterations = $metadataFiles.Count
for ($i=0;$i -le $iterations-1;$i++) {
$iFile = "whidata"+$i+".htm"
$FileExists = (Test-Path $metadataPath$iFile -PathType Leaf)
if (!($FileExists))
{
break
}
elseif ($FileExists)
{
Write-Host "Adding " $metadataPath$iFile
Get-Content $metadataPath$iFile | Out-File $cFile -append
Write-Host "to" $cfile
}
}
}
The whidataXX.htm
files are encoded UTF8, but my output file is encoded UTF16. When I view the file in Notepad, it appears correct, but when I view it in a Hex Editor, the Hex value 00
appears between each character, and when I pull the file into a Java program for processing, the file prints to the console with extra spaces between c h a r a c t e r s
.
First, is this normal for PowerShell? or is there something in the source files that would cause this?
Second, how would I fix this encoding problem in the code noted above?
UTF7 Uses UTF-7. UTF8 Uses UTF-8 (with BOM). In general, Windows PowerShell uses the Unicode UTF-16LE encoding by default. However, the default encoding used by cmdlets in Windows PowerShell is not consistent. Using any Unicode encoding, except UTF7, always creates a BOM.
With this tool you can easily convert UTF8 data to UTF16 data. UTF8 and UTF16 are two different encodings. UTF8 uses a variable length encoding scheme that encodes each Unicode code point using one to four bytes but UTF16 is fixed at two or four bytes.
Certainly, the easiest way is to load the script into notepad, then save it again with the UTF-8 encoding. It's an option in the Save As dialog box.. Show activity on this post. Perhaps with iconv? Show activity on this post.
Without the UTF-8 part, the encoding of that text file—how you even tell what characters you have—is unspecified. UTF-8 specifies down at the level of bits in a stream, what those bits mean and how they translate to characters. They are two independent things, but can interact. CSV is an approach to storing structured data in a plain text file.
The Out-* cmdlets (like Out-File) format the data, and the default format is unicode.
You can add an -Encoding parameter to Out-file:
Get-Content $metadataPath$iFile | Out-File $cFile -Encoding UTF8 -append
or switch to Add-Content, which doesn't re-format
Get-Content $metadataPath$iFile | Add-Content $cFile
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With