Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove non UTF-8 characters from text file

I have a bunch of Arabic, English, Russian files which are encoded in utf-8. Trying to process these files using a Perl script, I get this error:

Malformed UTF-8 character (fatal) 

Manually checking the content of these files, I found some strange characters in them. Now I'm looking for a way to automatically remove these characters from the files.

Is there anyway to do it?

like image 200
Hakim Avatar asked Oct 21 '12 16:10

Hakim


People also ask

How do I remove a non UTF-8 character from a text file in Linux?

To automatically find and delete non-UTF-8 characters, we're going to use the iconv command. It is used in Linux systems to convert text from one character encoding to another.

What is a non UTF-8 character?

A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text. Follow this answer to receive notifications. answered Oct 2, 2019 at 23:30. Tom Blodget.

How do I get rid of UTF-8?

UTF-8 is simply one possible encoding for text. UTF-8 is Unicode and every character can be converted to Unicode hence to remove all UTF-8 characters will basically remove all characters.


1 Answers

This command:

iconv -f utf-8 -t utf-8 -c file.txt 

will clean up your UTF-8 file, skipping all the invalid characters.

-f is the source format -t the target format -c skips any invalid sequence 
like image 60
Palantir Avatar answered Oct 16 '22 23:10

Palantir