Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to open a file with utf-8 non encoded characters?

I want to open a text file (.dat) in python and I get the following error: 'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.

Here is my code

import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
    if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
        print(line)
searchfile.close()
like image 975
StudentOIST Avatar asked Jan 29 '23 09:01

StudentOIST


2 Answers

It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92; if you did:

with open('compounds.dat', 'rb') as f:
    data = f.read()

the invalid byte would be at data[4484], and you can slice as you like to figure out what's around it.

In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):

# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io

with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
    for line in f:
        if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
            print(line)

will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.

like image 149
ShadowRanger Avatar answered Jan 31 '23 22:01

ShadowRanger


if working with huge data , better to use encoding as default and if the error persists then use errors="ignore" as well

with open("filename" , 'r'  , encoding="utf-8",errors="ignore") as f:
    f.read()
like image 28
Shilpa Shinde Avatar answered Jan 31 '23 22:01

Shilpa Shinde