Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a file in python having rogue byte 0xc0 that causes utf-8 and ascii to error out

Trying to read a tab-separated file into pandas dataframe:

>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False)

It errors out like so:

b'Skipping line 58: expected 11 fields, saw 12\n'
Traceback (most recent call last):
...(many lines)...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 115: invalid start byte

It seems the byte 0xc0 causes pain at both utf-8 and ascii encodings.

>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ascii')
...(many lines)...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 115: ordinal not in range(128)

I ran into the same issues with csv module's reader too.
If I import the file into OpenOffice Calc, it gets imported properly, the columns are properly recognized etc. Probably the offending 0xc0 byte is ignored there. This is not some vital piece of the data etc, it's probably just a fluke write error by the system that generated this file. I'll be happy to even zap the line where his occurs if it comes to that. I just want to read the file into the python program. The error_bad_lines=False option of pandas ought to have taken care of this problem but no dice. Also, the file does NOT have any content in non-english scripts that makes unicode so necessary. It's all standard english letters and numbers. I tried utf-16 utf-32 etc too but they only caused more errors of their own.

How to make python (pandas Dataframe in particular) read a file having one or more rogue byte 0xc0 characters?

like image 626
Nikhil VJ Avatar asked Nov 28 '25 04:11

Nikhil VJ


1 Answers

Moving this answer here from another place where it got a hostile reception.

Found one standard that actually accepts (meaning, doesn't error out) byte 0xc0 :

encoding="ISO-8859-1"  

Note: This entails making sure the rest of the file doesn't have unicode chars. This may be helpful for folks like me who didn't have any unicode chars in their file anyways and just wanted python to load the damn thing and both utf-8 and ascii encodings were erroring out.

More on ISO-8859-1 : What is the difference between UTF-8 and ISO-8859-1?

New command that works:

>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ISO-8859-1')

After reading it in, the dataframe is fine, the columns, data are all working like they did in OpenOffice Calc. I still have no idea where the offending 0xc0 byte went but it doesn't matter as I've got the data I needed.

like image 188
Nikhil VJ Avatar answered Nov 30 '25 18:11

Nikhil VJ



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!