Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas read_csv import gives mixed type for a column

Tags:

python

pandas

I have a csv file that contains 130,000 rows. After reading in the file using pandas' read_csv function, one of the Column("CallGuid") has mixed object types.

I did:

df = pd.read_csv("data.csv")

Then I have this:

In [10]: df["CallGuid"][32767]
Out[10]: 4129237051L    

In [11]: df["CallGuid"][32768]
Out[11]: u'4129259051'

All rows <= 32767 are of type long and all rows > 32767 are unicode

Why is this?

like image 846
lessthanl0l Avatar asked Aug 27 '14 15:08

lessthanl0l


2 Answers

OK I just experienced the same problem, with the same symptom : df[column][n] changed type after n>32767

I indeed had a problem in my data, but not at all at line 32767

Finding and modifying these few problematic lines solved my problem. I managed to localize the line that was problematic by using the following extremely dirty routine :

df = pd.read_csv('data.csv',chunksize = 10000)
i=0
for chunk in df:
    print "{} {}".format(i,chunk["Custom Dimension 02"].dtype)
    i+=1

I ran this and I obtained :

0 int64
1 int64
2 int64
3 int64
4 int64
5 int64
6 object
7 int64
8 object
9 int64
10 int64

Which told me that there was (at least) one problematic line between 60000 and 69999 and one between 80000 and 89999

To localize them more precisely, you can just take a smaller chunksize and print only the number of the rows that do not have the correct dta type

like image 136
WNG Avatar answered Sep 18 '22 14:09

WNG


As others have pointed out, your data could be malformed, like having quotes or something...

Just try doing:

import pandas as pd
import numpy as np

df = pd.read_csv("data.csv", dtype={"CallGuid": np.int64})

It's also more memory efficient, since pandas doesn't have to guess the data types.

like image 29
paulo.filip3 Avatar answered Sep 22 '22 14:09

paulo.filip3