Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using powershell to replace extended ascii character in a text file

I'm needing to replace a hex 93 character to a "" string inside several csv files. Below is the code that I'm using. But it is not working I think the reason that it does not work is because the hex value is greater than 7F (Dec 127). I've tried several other methods to no avail. Any help would be appreciated.

$q1 = [String](0x93 -as [char])
Get-ChildItem ".\*.csv" -Recurse | ForEach {
(Get-Content $_ | ForEach  { $_.replace($q1, '""') }) |
     Set-Content $_
}

Note: Attach is a image of the format-hex dump of my test file. The first character is the one that I need to perform the replace on: enter image description here

like image 747
HockChai Lim Avatar asked Sep 04 '18 21:09

HockChai Lim


People also ask

How do I replace text in a text file in PowerShell?

One way to do that is to use the -replace operator. This PowerShell operator finds a string and replaces it with another. Using the example file contents, we can provide the search string foo with the replacement string bar which should make the file contents foo foo baz now.

How do I replace a character in PowerShell?

Using the Replace() Method The replace() method has two arguments; the string to find and the string to replace the found text with. As you can see below, PowerShell is finding the string hello and replacing that string with the string hi . The method then returns the final result which is hi, world .

How do I get extended ASCII characters?

On a standard 101 keyboard, special extended ASCII characters such as é or ß can be typed by holding the ALT key and typing the corresponding 4 digit ASCII code. For example é is typed by holding the ALT key and typing 0233 on the keypad.

Which ASCII codes are used for the extended set of characters?

The extended ASCII characters includes the binary values from 128 (1000 0000) through 255 (1111 1111). Unlike standard ASCII characters, there are multiple versions of the extended ASCII character set.


1 Answers

In Windows PowerShell, the default character encoding when reading from / writing to[1]files is "ANSI", i.e., the legacy 8-bit code page implied by the active system locale.
(By contrast, PowerShell Core defaults to UTF-8.)

For instance, the code page associated with the system locale on an US-English system is 1252, i.e., Windows-1252, where code point 0x93 is the non-ASCII quotation mark.

Howere, once a text file's content has been read into memory, in memory a string's characters are represented as UTF-16LE code units, i.e., as .NET [string] instances.

As a Unicode character, has code point U+201c, expressed as 0x201c in UTF-16LE.

Therefore - because in memory all strings are UTF-16LE code units - what you need to replace is [char] 0x201c:

$q1 = [char] 0x201c  # “
Get-ChildItem *.csv -Recurse | ForEach-Object {
  (Get-Content $_.FullName) -replace $q1, '""' | Set-Content $_.FullName
}

Note that Set-Content too uses the default character encoding, so the rewritten files will use "ANSI" encoding too - use the -Encoding parameter to change the output encoding, if desired.

Also note the (...) around the Get-Content call, which ensures that the input file i read into memory in full up front, which enables writing back to the same file in the same pipeline.
While this approach is convenient, note that it bears a slight risk of data loss if writing back to the input file is interrupted before completion.


Converting an "ANSI" code point to a Unicode code point

The following shows how an "ANSI" (8-bit) code point such as 0x93 can be converted to its equivalent UTF-16 code point, 0x201c:

# Convert an array of "ANSI" code points (1 byte each) to the UTF-16
# string they represent. 
# Note: In Windows PowerShell, [Text.Encoding]::Default contains
#       the "ANSI" encoding set by the system locale.
$str = [Text.Encoding]::Default.GetString([byte[]] 0x93) # -> '“'

# Get the UTF-16 code points of the characters making up the string.
$codePoints = [int[]] [char[]] $str

# Format the first and only code point as a hex. number.
'0x{0:x}' -f $codePoints[0]  # -> '0x201c'

[1] Writing files with Set-Content, that is; using Out-File / >, by contrast, creates UTF-16LE ("Unicode") files. The cmdlets in Windows PowerShell display a bewildering array of differing encodings: see this answer. Fortunately, PowerShell Core now consistently defaults to (BOM-less) UTF-8.

like image 67
mklement0 Avatar answered Sep 22 '22 13:09

mklement0