How to open a file with utf-8 non encoded characters?

Question

I want to open a text file (.dat) in python and I get the following error: 'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.

Here is my code

import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
    if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
        print(line)
searchfile.close()

Here is my code

import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
    if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
        print(line)
searchfile.close()

ShadowRanger · Accepted Answer

It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92; if you did:

with open('compounds.dat', 'rb') as f:
    data = f.read()

the invalid byte would be at data[4484], and you can slice as you like to figure out what's around it.

In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):

# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io

with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
    for line in f:
        if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
            print(line)

will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.

Shilpa Shinde · Answer

if working with huge data , better to use encoding as default and if the error persists then use errors="ignore" as well

with open("filename" , 'r'  , encoding="utf-8",errors="ignore") as f:
    f.read()

How to open a file with utf-8 non encoded characters?

Tags:

python

encoding

utf-8

StudentOIST

2 Answers

ShadowRanger

Shilpa Shinde

Recent Activity

Donate For Us

How to open a file with utf-8 non encoded characters?

Tags:

python

encoding

utf-8

StudentOIST

2 Answers

ShadowRanger

Shilpa Shinde

Related questions

Recent Activity

Donate For Us