I am trying to find a way using Powershell Script to do the following.
By non-ASCII characters, I'm referring to non keyboard characters, e.g. accented characters, characters from another language, etc.
Sample Data
- 张伟
- குழந்தைகளுக்கான பெயர்கள்
- 日本人の氏名
- Full Name
- Léna Rémi
Output Data
- 张伟
- குழந்தைகளுக்கான பெயர்கள்
- 日本人の氏名
- Léna Rémi
I found the regex in other threads to remove non-ASCII characters but I couldn't seem to make it work.
Please help!
** EDIT ** Thanks everyone for the help! I have managed to do what I wanted with the below script.
$nonASCII = "[^\x00-\x7F]"
foreach ($_ in [System.IO.File]::ReadLines($source)){
if ($_ -cmatch $nonASCII){
write-output $_ | out-File $output -append
}
}
Powershell find non-ASCII characters in text file 1 For each line in text file, check if line contains non-ASCII characters 2 If line contains non-ASCII characters, output to separate file 3 If line does not contain non-ASCII characters, skip to next line More ...
To do that we simply specify the target character (an asterisk), followed by the replacement text (the @ sign). The only tricky part here is that the asterisk is a reserved character in Windows PowerShell; because of that we need to “escape” the character before we can perform a search-and-replace operation using that character.
1. Ctrl-F ( View -> Find ) 3. Select search mode as 'Regular expression' 4. Volla !! This will help you to track or replace all non-ascii charater in text file. I am a Data Consultant at a Canadian financial firm. My keen interests varies from Data Analytics, ML, Kubernetes, NLP to ETL. I love to blog and travel in my spare time.
What you need is to create a simple string to feed to -replace; this is a bit more involved when you are dealing with non-printable characters. You had the right idea--you just have to tell PowerShell to interpolate the code expressions within your string using the $ () notation on each expression:
Define a character set that describes all ASCII characters (code points 32 through 127 == [\x20-\x7F]
), then negate it with ^
to match any non-ASCII character!
Let's test it against my (non-ASCII) name:
PS C:\> 'Mathias R. Jessen' -cmatch '[^\x20-\x7F]'
False
PS C:\> 'Mathias Rørbo Jessen' -cmatch '[^\x20-\x7F]'
True
To filter a list of strings, simply use the -cmatch
operator in filter mode:
$strings = 'குழந்தைகளுக்கான பெயர்கள்', 'Boring John Doe', 'Léna Rémi'
$nonASCIIstrings = @($strings) -cmatch '[^\x20-\x7F]'
Or if you want to filter along a pipeline, use Where-Object
:
$strings |Where-Object {$_ -cmatch '[^\x20-\x7F]'}
Here's a script I have to remove non-ascii characters from an xml file. Maybe you can use it as a starting point. I'm removing characters that are not between space and tilde in the ascii table, and also not tab. To me, ascii is in the range 0-127. Get-content takes out the carriage returns and linefeeds.
(get-content $args[0]) -replace '[^ -~\t]' | set-content $args[0]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With