I have a csv file that contains 130,000 rows. After reading in the file using pandas' read_csv function, one of the Column("CallGuid") has mixed object types.
I did:
df = pd.read_csv("data.csv")
Then I have this:
In [10]: df["CallGuid"][32767]
Out[10]: 4129237051L
In [11]: df["CallGuid"][32768]
Out[11]: u'4129259051'
All rows <= 32767 are of type long
and all rows > 32767 are unicode
Why is this?
OK I just experienced the same problem, with the same symptom : df[column][n] changed type after n>32767
I indeed had a problem in my data, but not at all at line 32767
Finding and modifying these few problematic lines solved my problem. I managed to localize the line that was problematic by using the following extremely dirty routine :
df = pd.read_csv('data.csv',chunksize = 10000)
i=0
for chunk in df:
print "{} {}".format(i,chunk["Custom Dimension 02"].dtype)
i+=1
I ran this and I obtained :
0 int64
1 int64
2 int64
3 int64
4 int64
5 int64
6 object
7 int64
8 object
9 int64
10 int64
Which told me that there was (at least) one problematic line between 60000 and 69999 and one between 80000 and 89999
To localize them more precisely, you can just take a smaller chunksize and print only the number of the rows that do not have the correct dta type
As others have pointed out, your data could be malformed, like having quotes or something...
Just try doing:
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv", dtype={"CallGuid": np.int64})
It's also more memory efficient, since pandas doesn't have to guess the data types.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With