Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read issue, 0xff in position 0

I 've generated a huge (6G) txt file using a windows command line program (samtools.exe):

.\samtools.exe mpileup -O bamfile.bam > txtfile.tsv

The generated file is actually a table separated by tab. When I tried to use pandas.read_table to open it, it gives me:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

When I tried to print the first line of the file, it is like this:
ÿþAL645882 473 N 1 ^!c I 1
Everything is normal except the first character. If I read it use 'rb', indeed the first character is 0xff.

I really want this table to be read as a pandas DataFrame, the file is huge, is there anyway I can let python ignore the 0xff byte? Or simply delete the byte in the file?

Thanks in advance!

like image 515
snail123815 Avatar asked Jul 12 '17 19:07

snail123815


2 Answers

That looks like a UTF-16 BOM header being misinterpreted:

In [25]: with open("tmp.csv", "wb") as fp:
    ...:     fp.write("a,b\n1,2".encode("utf-16"))
    ...: 

In [26]: open("tmp.csv", "rb").read().decode("latin-1")
Out[26]: 'ÿþa\x00,\x00b\x00\n\x001\x00,\x002\x00'

In [27]: print(open("tmp.csv", "rb").read().decode("latin-1"))
ÿþa,b
1,2

So you could try interpreting it as UTF-16:

In [29]: pd.read_csv("tmp.csv")
[...]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

In [30]: pd.read_csv("tmp.csv", encoding='utf-16')
Out[30]: 
   a  b
0  1  2

(There are other hacks you could do if it really was only the first two bytes which were causing problems, such as opening a file pointer and reading two bytes, but I suspect as in the above example there are null bytes in the file that aren't immediately obvious, and so it's best to use the right encoding instead.)

like image 167
DSM Avatar answered Oct 17 '22 23:10

DSM


It could work for windows7 spyder3.6 data=pd.read_csv("C:/Users/Manjeesh/all_state_cancer.csv",encoding='iso-8859-1')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 607: invalid start byte

Result:

:data=pd.read_csv("C:/Users/Manjeesh/all_state_cancer.csv",encoding='iso-8859-1')

data
Out[207]: 
     s.no           user.location  \
0       1               Ahmedabad   
1       2   Madhya Pradesh, India   
2       3           Shahdol (MP)    
3       4           Shahdol (MP)    
4       5               Ahmedabad   
5       6        Bengaluru, India   
6       7   Madhya Pradesh, India   
like image 44
Manjeshan Avatar answered Oct 18 '22 01:10

Manjeshan