Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read UTF-8 files correctly with PowerShell

Following situation:

  • A PowerShell script creates a file with UTF-8 encoding
  • The user may or may not edit the file, possibly losing the BOM, but should keep the encoding as UTF-8, and possibly changing the line separators
  • The same PowerShell script reads the file, adds some more content and writes it all as UTF-8 back to the same file
  • This can be iterated many times

With Get-Content and Out-File -Encoding UTF8 I have problems reading it correctly. It's stumbling over the BOM it has written before (putting it in the content, breaking my parsing regex), does not use UTF-8 encoding and even deletes line breaks in the original content part.

I need a function that can read any file with UTF-8 encoding, ignore and delete the BOM and not modify the content. What should I use?

Update

I have added a little test script that shows what I'm trying to do and what happens instead.

# Read data if exists
$data = ""
$startRev = 1;
if (Test-Path test.txt)
{
    $data = Get-Content -Path test.txt
    if ($data -match "^[0-9-]{10} - r([0-9]+)")
    {
        $startRev = [int]$matches[1] + 1
    }
}
Write-Host Next revision is $startRev

# Define example data to add
$startRev = $startRev + 10
$newMsgs = "2014-04-01 - r" + $startRev + "`r`n`r`n" + `
    "Line 1`r`n" + `
    "Line 2`r`n`r`n"

# Write new data back
$data = $newMsgs + $data
$data | Out-File test.txt -Encoding UTF8

After running it a few times, new sections should be added to the beginning of the file, the existing content should not be altered in any way (currently loses line breaks) and no additional new lines should be added at the end of the file (seems to happen sometimes).

Instead, the second run gives me an error.

like image 957
ygoe Avatar asked Apr 01 '14 14:04

ygoe


People also ask

How do I view a UTF-8 file?

Open the file in Notepad. Click 'Save As...'. In the 'Encoding:' combo box you will see the current file format. Yes, I opened the file in notepad and selected the UTF-8 format and saved it.

What is BOM in PowerShell?

The byte-order-mark For more information, see the Byte order mark documentation. In Windows PowerShell, any Unicode encoding, except UTF7 , always creates a BOM.

How do I get the content of a file in PowerShell?

The Get-Content cmdlet gets the content of the item at the location specified by the path, such as the text in a file or the content of a function. For files, the content is read one line at a time and returns a collection of objects, each of which represents a line of content.


2 Answers

If the file is supposed to be UTF8 why don't you try to read it decoding UTF8 :

Get-Content -Path test.txt -Encoding UTF8
like image 57
JPBlanc Avatar answered Dec 06 '22 11:12

JPBlanc


Really JPBlanc is right. If you want it read as UTF8 then specify that when the file is read.

On a side note, you're losing formatting in here with the [String]+[String] stuff. Not to mention your regex match doesn't work. Check out the regex search changes, and the changes made to the $newMsgs, and the way I'm outputting your data to the file.

# Read data if exists
$data = ""
$startRev = 1;
if (Test-Path test.txt)
{
    $data = Get-Content -Path test.txt #-Encoding UTF8
    if($data -match "\br([0-9]+)\b"){
        $startRev = [int]([regex]::Match($data,"\br([0-9]+)\b")).groups[1].value + 1
    }
}
Write-Host Next revision is $startRev

# Define example data to add
$startRev = $startRev + 10
$newMsgs = @"
2014-04-01 - r$startRev`r`n`r`n
    Line 1`r`n
    Line 2`r`n`r`n
"@

# Write new data back
$newmsgs,$data | Out-File test.txt -Encoding UTF8
like image 30
TheMadTechnician Avatar answered Dec 06 '22 13:12

TheMadTechnician