Pandas DataFrame iloc spoils the data type

Tags:

Having pandas 0.19.2.

Here's an example:

testdf = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [1.0, 2.0, 3.0, 4.0]})
testdf.dtypes

Output:

A      int64
B    float64
dtype: object

Everything looks fine for now, but what I don't like is that (note, that first call is a pd.Series.iloc and the second one is pd.DataFrame.iloc)

print(type(testdf.A.iloc[0]))
print(type(testdf.iloc[0].A))

Output:

<class 'numpy.int64'>
<class 'numpy.float64'>

I found it while trying to understand why pd.DataFrame.join() operation returned almost no intersections of two int64 columns while there should be many. My guess is because of type inconsistency which might be connected with this behaviour, but I'm not sure... My short investigation revealed the thing above and now I'm confused a bit.

If someone knows how to solve it - I'll be very grateful for any hints!

UPD

Thanks to @EdChum for comments. So here is the example with my generated data and join/merge behaviour

testdf.join(testdf, on='A', rsuffix='3')

    A   B   A3  B3 
0   1   1.0 2.0 2.0
1   2   2.0 3.0 3.0
2   3   3.0 4.0 4.0
3   4   4.0 NaN NaN

And what is considered to be quite the same pd.merge(left=testdf, right=testdf, on='A') returns

    A   B_x B_y
0   1   1.0 1.0
1   2   2.0 2.0
2   3   3.0 3.0
3   4   4.0 4.0

UPD2 Replicating @EdChum comment on join and merge behaviour. The problem is that A.join(B, on='C') will use index in A and join it with column B['C'], since by default join uses index. In my case I just used merge to get desireable result.

838

asked Jan 15 '17 15:01

ghastly_kitten

1 Answers

This is as expected. pandas tracks dtypes per column. When you call testdf.iloc[0] you are asking pandas for a row. It has to convert the entire row into a series. That row contained a float. Therefore the row as a series must be float.

However, it seems that when pandas uses loc or iloc it makes this conversion when you use a single __getitem__

Here are some interesting test cases for a testdf with one int column

testdf = pd.DataFrame({'A': [1, 2, 3, 4]})

print(type(testdf.iloc[0].A))
print(type(testdf.A.iloc[0]))

<class 'numpy.int64'>
<class 'numpy.int64'>

Change it to OP test case

testdf = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [1.0, 2.0, 3.0, 4.0]})

print(type(testdf.iloc[0].A))
print(type(testdf.A.iloc[0]))

<class 'numpy.float64'>
<class 'numpy.int64'>

print(type(testdf.loc[0, 'A']))
print(type(testdf.iloc[0, 0]))
print(type(testdf.at[0, 'A']))
print(type(testdf.iat[0, 0]))
print(type(testdf.get_value(0, 'A')))

<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>

So, it appears that when pandas uses loc or iloc it makes some conversions across rows which I still don't fully understand. I'm sure it has something to do with the fact that the nature of loc and iloc are different than at, iat, get_value in that iloc and loc allow you to access the dataframe with index arrays and boolean arrays. While at, iat, and get_value only access a single cell at a time.

Despite that

testdf.loc[0, 'A'] = 10

print(type(testdf.at[0, 'A']))

When we assign to that location via loc, pandas ensures the dtype stays consistent.

165

answered Sep 21 '22 11:09

piRSquared

Related questions
                            
                                Wrapping a python class around JSON data, which is better?
                            
                                PyQt4 force view to fetchMore from QAbstractItemModel
                            
                                Django admin: timezone display
                            
                                PySpark distributing module imports
                            
                                How to programatically detect if code is running in nuitka compiled or python interpreted mode
                            
                                Cython's prange not improving performance
                            
                                Get lag with cross-correlation?
                            
                                Comparing two DataFrames by one column with a return of three different outputs with Panadas
                            
                                Recognizing cx_Oracle install within PyDev
                            
                                Django Db routing
                            
                                Spark problems with imports in Python
                            
                                Is there a browser based IDE for Python like RStudio server for R?
                            
                                Check object permission before create in Django rest framework
                            
                                Under what circumstances does Scrapy throw a "Connection was closed cleanly" error?
                            
                                Python does not recognize a module which is set by PYTHONPATH
                            
                                How can I split DataFrame (pandas) on pages with django paginator?
                            
                                Class imported from two different paths is not equal?
                            
                                Fix 'new enumerations must be created as'
                            
                                Python eval doesn't work inside a function [duplicate]
                            
                                Python OpenCV - overlay an image with transparency

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas DataFrame iloc spoils the data type

Tags:

python

python-3.x

pandas

ghastly_kitten

People also ask

1 Answers

piRSquared

Recent Activity

Donate For Us