I have a DataFrame:
name | age | |
---|---|---|
0 | Paul | 25 |
1 | John | 27 |
2 | Bill | 23 |
I know that if I enter:
df[['name']] = df[['age']]
I'll get the following:
name | age | |
---|---|---|
0 | 25 | 25 |
1 | 27 | 27 |
2 | 23 | 23 |
But I expect the same result from the command:
df.loc[:, ['name']] = df.loc[:, ['age']]
But instead, I get this:
name | age | |
---|---|---|
0 | NaN | 25 |
1 | NaN | 27 |
2 | NaN | 23 |
For some reason, if I omit those square brackets []
around column names, I'll get exactly what I expected. That is the command:
df.loc[:, 'name'] = df.loc[:, 'age']
gives the right result:
name | age | |
---|---|---|
0 | 25 | 25 |
1 | 27 | 27 |
2 | 23 | 23 |
Why does two pairs of brackets with .loc
result in NaN? Is it some sort of a bug or is it intended behaviour? I can't figure out the reason for such a behaviour.
loc[] method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller data frame. Syntax: pandas.DataFrame.loc[]
Indexing DataFrames You can either use a single bracket or a double bracket. The single bracket will output a Pandas Series, while a double bracket will output a Pandas DataFrame. Square brackets can also be used to access observations (rows) from a DataFrame.
The main difference between pandas loc[] vs iloc[] is loc gets DataFrame rows & columns by labels/names and iloc[] gets by integer Index/position. For loc[], if the label is not present it gives a key error. For iloc[], if the position is not present it gives an index error.
Pandas Data Alignment
(emphasis mine):pandas aligns all AXES when setting Series and DataFrame from .loc, and .iloc.
You can find this excerpt labelled with Warning under the Basics
header.
They have given an example to explain.
In [9]: df[['A', 'B']]
Out[9]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
In [10]: df.loc[:, ['B', 'A']] = df[['A', 'B']]
In [11]: df[['A', 'B']]
Out[11]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
From Docs(emphasis mine):
This will not modify df because the column alignment is before value assignment.
Accessing the array can be useful when you need to do some operation without the index (to disable automatic alignment, for example).
The alignment comes into play when LHS and RHS are dataframes. To avoid alignment try using.
df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()
You have two cases at hand,
.loc
assignment with pd.DataFrame
..loc
assignment with pd.Series
in EDIT..loc
assignment in pd.DataFrame
In pd.DataFrame
has 2 axes index
and columns
. So, when you do
df.loc[:, ['name']] = df.loc[:, ['age']]
LHS has column A
which doesn't align with RHS column B
hence resulting in all NaN
after assignment.
From Docs DataAlignment
(emphasis mine)
Data alignment between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels.
You can find this behaviour in most of the pandas' operations if not all. Example, addition, subtraction, multiplication etc. Non matching indices and columns are filled with NaN
.
Example from Data Alignment and Arthimetics
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"]) df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"]) df + df2 A B C D 0 0.045691 -0.014138 1.380871 NaN 1 -0.955398 -1.501007 0.037181 NaN 2 -0.662690 1.534833 -0.859691 NaN 3 -2.452949 1.237274 -0.133712 NaN 4 1.414490 1.951676 -2.320422 NaN 5 -0.494922 -1.649727 -1.084601 NaN 6 -1.047551 -0.748572 -0.805479 NaN 7 NaN NaN NaN NaN 8 NaN NaN NaN NaN 9 NaN NaN NaN NaN
To answer your comment
But why do column indexes need to match? I can see why one want row indexes to match, but why column indexes?
Let's take a look at the above example, if columns are not aligned how would you add two DataFrames? It makes sense to align them on columns and indices.
.loc
assignment in pd.Series
pd.Series
has only one axis i.e index
. That is the reason why it worked when you did
df.loc[:, 'name'] = df.loc[:, 'age']
As pd.Series
has only one axis, pandas tried to align index
and it worked. Of course, if index
doesn't align it results in NaN
values.
Series Alignment
(emphasis mine):The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing
NaN
.
That's because for the loc
assignment all index axes are aligned, including the columns: Since age
and name
do not match, there is no data to assign, hence the NaNs.
You can make it work by renaming the columns:
df.loc[:, ["name"]] = df.loc[:, ["age"]].rename(columns={"age": "name"})
or by accessing the numpy array:
df.loc[:, ["name"]] = df.loc[:, ["age"]].values
When you use double brackets [[]] you are assigning a DataFrame. What you want is assign a (column) Series, and for that you use only one bracket [].
Here is some code:
import pandas as pd
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)
df[['name']] = df[['age']]
print("\ndf[['name']] = df[['age']]\n",df)
print("df.loc[:, ['age']]:", type(df.loc[:, ['age']]))
print("df.loc[:, ['name']]:", type(df.loc[:, ['name']]))
df.loc[:, ['name']] = df.loc[:, ['age']]
print("\ndf.loc[:, ['name']] = df.loc[:, ['age']]\n",df)
print('=======================')
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)
print("type(df.loc[:, 'age']):", type(df.loc[:, 'age']))
print("type(df.loc[:, 'name']):", type(df.loc[:, 'name']))
df.loc[:, 'name'] = df.loc[:, 'age']
print("\ndf.loc[:, 'name'] = df.loc[:, 'age']\n",df)
And the output:
Inital Dataframe:
name age
0 Paul 25
1 John 27
2 Bill 23
df[['name']] = df[['age']]
name age
0 25 25
1 27 27
2 23 23
df.loc[:, ['age']]: <class 'pandas.core.frame.DataFrame'>
df.loc[:, ['name']]: <class 'pandas.core.frame.DataFrame'>
df.loc[:, ['name']] = df.loc[:, ['age']]
name age
0 NaN 25.0
1 NaN 27.0
2 NaN 23.0
=======================
Inital Dataframe:
name age
0 Paul 25
1 John 27
2 Bill 23
type(df.loc[:, 'age']): <class 'pandas.core.series.Series'>
type(df.loc[:, 'name']): <class 'pandas.core.series.Series'>
df.loc[:, 'name'] = df.loc[:, 'age']
name age
0 25 25
1 27 27
2 23 23
However, here is another strange behaviour: Assigning the double brackets to difference variables, say df1
and df2
, and then df1 = df2
works!
Here is some more code:
df = pd.DataFrame({'name':['Paul','John','Bill'], 'age':[25,27,23]})
print('Inital Dataframe:\n',df)
df1 = df.loc[:, ['name']]
df2 = df.loc[:, ['age']]
print("\ndf1 = df.loc[:, ['name']]\n",df1)
print("\ndf2 = df.loc[:, ['age']]\n",df2)
df1=df2
print("\ndf1=df2\ndf1:\n",df1)
And the output:
Inital Dataframe:
name age
0 Paul 25
1 John 27
2 Bill 23
df1 = df.loc[:, ['name']]
name
0 Paul
1 John
2 Bill
df2 = df.loc[:, ['age']]
age
0 25
1 27
2 23
df1=df2
df1:
age
0 25
1 27
2 23
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With