Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Import-Csv / Export-Csv with german umlauts (ä,ö,ü)

i came across a little issue when dealing with csv-exports which contains mutated vowels like ä,ö,ü (German Language Umlaute)

i simply export with

Get-WinEvent -FilterHashtable @{Path=$_;ID=4627} -ErrorAction SilentlyContinue |export-csv -NoTypeInformation -Encoding Default -Force ("c:\temp\CSV_temp\"+ $_.basename + ".csv")

which works fine. i have the ä,ö,ü in my csv-file correctly.

after that i do a little sorting with:

Get-ChildItem 'C:\temp\*.csv' |
    ForEach-Object { Import-Csv $_.FullName } |
    Sort-Object { [DateTime]::ParseExact($_.TimeCreated, $pattern, $culture) } |
    Export-Csv 'C:\temp\merged.csv' -Encoding Default -NoTypeInformation -Force

i played around with all encodings, ASCII, BigEndianUnicode, UniCode(s) with no success.

how can i preserve the special characters ä,ö,ü and others when exporting and sorting?

like image 724
Peter Core Avatar asked Feb 23 '18 11:02

Peter Core


People also ask

What is import from CSV?

The Import from CSV command parses a comma-separated value (CSV) file and loads it into the project repository. CSV is a long-standing simple data exchange mechanism supported by everything from Microsoft Excel® to specialty tools in test, risk, and beyond.

Are CSV files Unicode?

and CSV files. Simple CSV files do not support Unicode/UTF-8 characters. This is a limitation of the CSV format and not something that can be changed in DEAR.


1 Answers

Mathias R. Jessen provides the crucial pointer in a comment on the question:

It is the Import-Csv call, not Export-Csv, that is the cause of the problem in your case:

Like Export-Csv, Import-Csv too needs to be passed -Encoding Default in order to properly process text files encoded with the system's active "ANSI" legacy code page, which is an 8-bit, single-byte character encoding such as Windows-1252.

In Windows PowerShell, even though the generic text-file processing Get-Content / Set-Content cmdlet pair defaults to Default encoding (as the name suggests), regrettably and surprisingly, Import-Csv and Export-Csv do not.

Note that on reading a default encoding is only assumed if the input file has no BOM (byte-order mark, a.k.a Unicode signature, a magic byte sequence at the start of the file that unambiguously identifies the file's encoding).

Not only do Import-Csv and Export-Csv have defaults that differ from Get-Content / Set-Content, they individually have different defaults:

  • Import-Csv defaults to UTF-8.
  • Export-Csv defaults to ASCII(!), which means that any non-ASCII characters -such as ä, ö, ü - are transliterated to literal ? chars., resulting in loss of data.

By contrast, in PowerShell Core, the cross-platform edition built on .NET Core, the default encoding is (BOM-less) UTF-8, consistently, across all cmdlets, which greatly simplifies matters and makes it much easier to determine when you do need to use the -Encoding parameter.


Demonstration of the Windows PowerShell Import-Csv / Export-Csv behavior

Import-Csv - defaults to UTF-8:

# Sample CSV content.
$str = @'
Column1
aäöü
'@

# Write sample CSV file 't.csv' using UTF-8 encoding *without a BOM*
# (Note that this cannot be done with standard PowerShell cmdlets.)
$null = new-item -type file t.csv -Force
[io.file]::WriteAllLines((Convert-Path t.csv), $str)

# Use Import-Csv to read the file, which correctly preserves the UTF-8-encoded
# umlauts
Import-Csv .\t.csv

The above yields:

Column1
-------
aäöü

As you can see, the umlauts were correctly preserved.

By contrast, had the file been "ANSI"-encoded ($str | Set-Content t.csv; -Encoding Default implied), the umlauts would have gotten corrupted.


Export-Csv - defaults to ASCII - risk of data loss:

Building on the above example:

Import-Csv .\t.csv | Export-Csv .\t.new.csv
Get-Content .\t.new.csv

yields:

"Column1"
"a???"

As you can see, the umlauts were replaced by literal question marks (?).

like image 186
mklement0 Avatar answered Oct 11 '22 20:10

mklement0