I have a big amount of files and parser. What I Have to do is strip all non utf-8 symbols and put data in mongodb. Currently I have code like this.
with open(fname, "r") as fp: for line in fp: line = line.strip() line = line.decode('utf-8', 'ignore') line = line.encode('utf-8', 'ignore')
somehow I still get an error
bson.errors.InvalidStringData: strings in documents must be valid UTF-8: 1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin
I don't get it. Is there some simple way to do it?
UPD: seems like Python and Mongo don't agree about definition of Utf-8 Valid string.
Go to File > Reopen with Encoding > UTF-8. Copy the entire content of the file into a new file and save it. May not be the expected solution but putting this out here in case it helps anyone, since I've been struggling for hours with this.
0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.
Try below code line instead of last two lines. Hope it helps:
line=line.decode('utf-8','ignore').encode("utf-8")
For python 3, as mentioned in a comment in this thread, you can do:
line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
The 'ignore' parameter prevents an error from being raised if any characters are unable to be decoded.
If your line is already a bytes object (e.g. b'my string'
) then you just need to decode it with decode('utf-8', 'ignore')
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With