Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Powershell find non-ASCII characters in text file

I am trying to find a way using Powershell Script to do the following.

  1. For each line in text file, check if line contains non-ASCII characters
  2. If line contains non-ASCII characters, output to separate file
  3. If line does not contain non-ASCII characters, skip to next line

By non-ASCII characters, I'm referring to non keyboard characters, e.g. accented characters, characters from another language, etc.

Sample Data

 - 张伟
 - குழந்தைகளுக்கான பெயர்கள்
 - 日本人の氏名
 - Full Name
 - Léna Rémi

Output Data

 - 张伟
 - குழந்தைகளுக்கான பெயர்கள்
 - 日本人の氏名
 - Léna Rémi

I found the regex in other threads to remove non-ASCII characters but I couldn't seem to make it work.

Please help!

** EDIT ** Thanks everyone for the help! I have managed to do what I wanted with the below script.

$nonASCII = "[^\x00-\x7F]"
foreach ($_ in [System.IO.File]::ReadLines($source)){
    if ($_ -cmatch $nonASCII){
        write-output $_ | out-File $output -append        
    }
}
like image 320
Arolix Avatar asked May 14 '20 08:05

Arolix


People also ask

How to find non-ASCII characters in text file using PowerShell?

Powershell find non-ASCII characters in text file 1 For each line in text file, check if line contains non-ASCII characters 2 If line contains non-ASCII characters, output to separate file 3 If line does not contain non-ASCII characters, skip to next line More ...

How do I search and replace text in PowerShell with asterisk?

To do that we simply specify the target character (an asterisk), followed by the replacement text (the @ sign). The only tricky part here is that the asterisk is a reserved character in Windows PowerShell; because of that we need to “escape” the character before we can perform a search-and-replace operation using that character.

How to track or replace all non-ASCII Charater in text file?

1. Ctrl-F ( View -> Find ) 3. Select search mode as 'Regular expression' 4. Volla !! This will help you to track or replace all non-ascii charater in text file. I am a Data Consultant at a Canadian financial firm. My keen interests varies from Data Analytics, ML, Kubernetes, NLP to ETL. I love to blog and travel in my spare time.

How to replace a character in a string using PowerShell?

What you need is to create a simple string to feed to -replace; this is a bit more involved when you are dealing with non-printable characters. You had the right idea--you just have to tell PowerShell to interpolate the code expressions within your string using the $ () notation on each expression:


2 Answers

Define a character set that describes all ASCII characters (code points 32 through 127 == [\x20-\x7F]), then negate it with ^ to match any non-ASCII character!

Let's test it against my (non-ASCII) name:

PS C:\> 'Mathias R. Jessen' -cmatch '[^\x20-\x7F]'
False
PS C:\> 'Mathias Rørbo Jessen' -cmatch '[^\x20-\x7F]'
True

To filter a list of strings, simply use the -cmatch operator in filter mode:

$strings = 'குழந்தைகளுக்கான பெயர்கள்', 'Boring John Doe', 'Léna Rémi'

$nonASCIIstrings = @($strings) -cmatch '[^\x20-\x7F]'

Or if you want to filter along a pipeline, use Where-Object:

$strings |Where-Object {$_ -cmatch '[^\x20-\x7F]'}
like image 159
Mathias R. Jessen Avatar answered Oct 21 '22 02:10

Mathias R. Jessen


Here's a script I have to remove non-ascii characters from an xml file. Maybe you can use it as a starting point. I'm removing characters that are not between space and tilde in the ascii table, and also not tab. To me, ascii is in the range 0-127. Get-content takes out the carriage returns and linefeeds.

(get-content $args[0]) -replace '[^ -~\t]' | set-content $args[0]
like image 1
js2010 Avatar answered Oct 21 '22 04:10

js2010