Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame iloc spoils the data type

Having pandas 0.19.2.

Here's an example:

testdf = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [1.0, 2.0, 3.0, 4.0]})
testdf.dtypes

Output:

A      int64
B    float64
dtype: object

Everything looks fine for now, but what I don't like is that (note, that first call is a pd.Series.iloc and the second one is pd.DataFrame.iloc)

print(type(testdf.A.iloc[0]))
print(type(testdf.iloc[0].A))

Output:

<class 'numpy.int64'>
<class 'numpy.float64'>

I found it while trying to understand why pd.DataFrame.join() operation returned almost no intersections of two int64 columns while there should be many. My guess is because of type inconsistency which might be connected with this behaviour, but I'm not sure... My short investigation revealed the thing above and now I'm confused a bit.

If someone knows how to solve it - I'll be very grateful for any hints!

UPD

Thanks to @EdChum for comments. So here is the example with my generated data and join/merge behaviour

testdf.join(testdf, on='A', rsuffix='3')

    A   B   A3  B3 
0   1   1.0 2.0 2.0
1   2   2.0 3.0 3.0
2   3   3.0 4.0 4.0
3   4   4.0 NaN NaN

And what is considered to be quite the same pd.merge(left=testdf, right=testdf, on='A') returns

    A   B_x B_y
0   1   1.0 1.0
1   2   2.0 2.0
2   3   3.0 3.0
3   4   4.0 4.0

UPD2 Replicating @EdChum comment on join and merge behaviour. The problem is that A.join(B, on='C') will use index in A and join it with column B['C'], since by default join uses index. In my case I just used merge to get desireable result.

like image 838
ghastly_kitten Avatar asked Jan 15 '17 15:01

ghastly_kitten


People also ask

What datatype does ILOC return?

Pandas. DataFrame. iloc is a unique built-in method that returns integer-location-based indexing for selection by position.

What does the pandas ILOC () function do?

The iloc() function in python is defined in the Pandas module that helps us to select a specific row or column from the data set. Using the iloc method in python, we can easily retrieve any particular value from a row or column by using index values.

Does ILOC return a copy?

They do not make copies of the row. You can use the copy() method on the row to solve your problem.

Is ILOC slower than LOC?

loc . I have a DataFrame with 4.8 million rows, and selecting a single row using . iloc[[ id ]] (with a single-element list) takes 489 ms, almost half a second, 1,800x times slower than the identical .


1 Answers

This is as expected. pandas tracks dtypes per column. When you call testdf.iloc[0] you are asking pandas for a row. It has to convert the entire row into a series. That row contained a float. Therefore the row as a series must be float.

However, it seems that when pandas uses loc or iloc it makes this conversion when you use a single __getitem__

Here are some interesting test cases for a testdf with one int column

testdf = pd.DataFrame({'A': [1, 2, 3, 4]})

print(type(testdf.iloc[0].A))
print(type(testdf.A.iloc[0]))

<class 'numpy.int64'>
<class 'numpy.int64'>

Change it to OP test case

testdf = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [1.0, 2.0, 3.0, 4.0]})

print(type(testdf.iloc[0].A))
print(type(testdf.A.iloc[0]))

<class 'numpy.float64'>
<class 'numpy.int64'>

print(type(testdf.loc[0, 'A']))
print(type(testdf.iloc[0, 0]))
print(type(testdf.at[0, 'A']))
print(type(testdf.iat[0, 0]))
print(type(testdf.get_value(0, 'A')))

<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>

So, it appears that when pandas uses loc or iloc it makes some conversions across rows which I still don't fully understand. I'm sure it has something to do with the fact that the nature of loc and iloc are different than at, iat, get_value in that iloc and loc allow you to access the dataframe with index arrays and boolean arrays. While at, iat, and get_value only access a single cell at a time.


Despite that

testdf.loc[0, 'A'] = 10

print(type(testdf.at[0, 'A']))

When we assign to that location via loc, pandas ensures the dtype stays consistent.

like image 165
piRSquared Avatar answered Sep 21 '22 11:09

piRSquared