Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete every non utf-8 symbols from string

I have a big amount of files and parser. What I Have to do is strip all non utf-8 symbols and put data in mongodb. Currently I have code like this.

with open(fname, "r") as fp:     for line in fp:         line = line.strip()         line = line.decode('utf-8', 'ignore')         line = line.encode('utf-8', 'ignore') 

somehow I still get an error

bson.errors.InvalidStringData: strings in documents must be valid UTF-8:  1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin 

I don't get it. Is there some simple way to do it?

UPD: seems like Python and Mongo don't agree about definition of Utf-8 Valid string.

like image 437
Darth Kotik Avatar asked Oct 24 '14 05:10

Darth Kotik


People also ask

How can I change a non UTF-8 character from a text file?

Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it. May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.

What are non UTF-8 characters?

0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.


2 Answers

Try below code line instead of last two lines. Hope it helps:

line=line.decode('utf-8','ignore').encode("utf-8") 
like image 56
Irshad Bhat Avatar answered Oct 13 '22 09:10

Irshad Bhat


For python 3, as mentioned in a comment in this thread, you can do:

line = bytes(line, 'utf-8').decode('utf-8', 'ignore') 

The 'ignore' parameter prevents an error from being raised if any characters are unable to be decoded.

If your line is already a bytes object (e.g. b'my string') then you just need to decode it with decode('utf-8', 'ignore').

like image 30
AlexG Avatar answered Oct 13 '22 10:10

AlexG