I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:
Malformed UTF-8 character (fatal)
Manually checking the content of these files, I found some strange characters in them. Now I'm looking for a way to automatically remove these characters from the files.
Is there anyway to do it?
To automatically find and delete non-UTF-8 characters, we're going to use the iconv command. It is used in Linux systems to convert text from one character encoding to another.
A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text. Follow this answer to receive notifications. answered Oct 2, 2019 at 23:30. Tom Blodget.
UTF-8 is simply one possible encoding for text. UTF-8 is Unicode and every character can be converted to Unicode hence to remove all UTF-8 characters will basically remove all characters.
This command:
iconv -f utf-8 -t utf-8 -c file.txt
will clean up your UTF-8 file, skipping all the invalid characters.
-f is the source format -t the target format -c skips any invalid sequence
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With