Having pandas 0.19.2.
Here's an example:
testdf = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [1.0, 2.0, 3.0, 4.0]})
testdf.dtypes
Output:
A int64
B float64
dtype: object
Everything looks fine for now, but what I don't like is that (note, that first call is a pd.Series.iloc
and the second one is pd.DataFrame.iloc
)
print(type(testdf.A.iloc[0]))
print(type(testdf.iloc[0].A))
Output:
<class 'numpy.int64'>
<class 'numpy.float64'>
I found it while trying to understand why pd.DataFrame.join()
operation returned almost no intersections of two int64
columns while there should be many.
My guess is because of type inconsistency which might be connected with this behaviour, but I'm not sure... My short investigation revealed the thing above and now I'm confused a bit.
If someone knows how to solve it - I'll be very grateful for any hints!
UPD
Thanks to @EdChum for comments. So here is the example with my generated data and join/merge behaviour
testdf.join(testdf, on='A', rsuffix='3')
A B A3 B3
0 1 1.0 2.0 2.0
1 2 2.0 3.0 3.0
2 3 3.0 4.0 4.0
3 4 4.0 NaN NaN
And what is considered to be quite the same
pd.merge(left=testdf, right=testdf, on='A')
returns
A B_x B_y
0 1 1.0 1.0
1 2 2.0 2.0
2 3 3.0 3.0
3 4 4.0 4.0
UPD2 Replicating @EdChum comment on join
and merge
behaviour. The problem is that A.join(B, on='C')
will use index in A
and join it with column B['C']
, since by default join uses index. In my case I just used merge to get desireable result.
Pandas. DataFrame. iloc is a unique built-in method that returns integer-location-based indexing for selection by position.
The iloc() function in python is defined in the Pandas module that helps us to select a specific row or column from the data set. Using the iloc method in python, we can easily retrieve any particular value from a row or column by using index values.
They do not make copies of the row. You can use the copy() method on the row to solve your problem.
loc . I have a DataFrame with 4.8 million rows, and selecting a single row using . iloc[[ id ]] (with a single-element list) takes 489 ms, almost half a second, 1,800x times slower than the identical .
This is as expected. pandas
tracks dtypes
per column. When you call testdf.iloc[0]
you are asking pandas for a row. It has to convert the entire row into a series. That row contained a float. Therefore the row as a series must be float.
However, it seems that when pandas uses loc
or iloc
it makes this conversion when you use a single __getitem__
Here are some interesting test cases for a testdf
with one int
column
testdf = pd.DataFrame({'A': [1, 2, 3, 4]})
print(type(testdf.iloc[0].A))
print(type(testdf.A.iloc[0]))
<class 'numpy.int64'>
<class 'numpy.int64'>
Change it to OP test case
testdf = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [1.0, 2.0, 3.0, 4.0]})
print(type(testdf.iloc[0].A))
print(type(testdf.A.iloc[0]))
<class 'numpy.float64'>
<class 'numpy.int64'>
print(type(testdf.loc[0, 'A']))
print(type(testdf.iloc[0, 0]))
print(type(testdf.at[0, 'A']))
print(type(testdf.iat[0, 0]))
print(type(testdf.get_value(0, 'A')))
<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
So, it appears that when pandas
uses loc
or iloc
it makes some conversions across rows which I still don't fully understand. I'm sure it has something to do with the fact that the nature of loc
and iloc
are different than at
, iat
, get_value
in that iloc
and loc
allow you to access the dataframe with index arrays and boolean arrays. While at
, iat
, and get_value
only access a single cell at a time.
Despite that
testdf.loc[0, 'A'] = 10
print(type(testdf.at[0, 'A']))
When we assign to that location via loc
, pandas
ensures the dtype
stays consistent.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With