So I was just answering a question and I came across something interesting:
The dataframe looks like this:
string1 string2
0 abc def
1 ghi jkl
2 mno pqr
3 stu vwx
So when I do the following, the assigning of new columns works:
df['string3'] = df.string2
print(df)
string1 string2 string3
0 abc def def
1 ghi jkl jkl
2 mno pqr pqr
3 stu vwx vwx
But when I use pandas.DataFrame.Series.sample
, the new column does net get assigned, at least not the sampled
one:
df['string4'] = df.string2.sample(len(df.string2))
print(df)
string1 string2 string3 string4
0 abc def def def
1 ghi jkl jkl jkl
2 mno pqr pqr pqr
3 stu vwx vwx vwx
So I tested some things:
Test1 Using sample without assign, gives us correct output:
df.string2.sample(len(df.string2))
2 pqr
1 jkl
0 def
3 vwx
Name: string2, dtype: object
Test2 Cannot overwrite either:
df['string2'] = df.string2.sample(len(df.string2))
print(df)
string1 string2
0 abc def
1 ghi jkl
2 mno pqr
3 stu vwx
This works but why?
df['string2'] = df.string2.sample(len(df.string2)).values
print(df)
string1 string2
0 abc jkl
1 ghi def
2 mno vwx
3 stu pqr
Why do I need to explicitly use .values
or .tolist()
to get the assigning correct?
pandas
is index
sensitive , which means they check the index
when assign
it , that is when you do the serise
assign , the whole df not change , since the index
is not change , after sort_index
, it still show the same order of values
, but if you do the numpy
array
assignment , the index
will not be considered , so that the value itself will be assign back to the original df
, which yield the output
An example of egde
df['string3']=pd.Series(['aaa','aaa','aaa','aaa'],index=[100,111,112,113])
df
Out[462]:
string1 string2 string3
0 abc vwx NaN
1 ghi jkl NaN
2 mno dfe NaN
3 stu pqr NaN
Because of that index sensitive when you do condition assignment with.loc
You can always do
df.loc[df.condition,'value']=df.value*100
# since the not selected one will not be change
Just same to what you do with np.where
df['value']=np.where(df.condition,df.value*100 ,df.value)
Some other use case
when I do groupby
apply
with none-agg function and try to assign it back ,why it is failed
df['String4']=df.groupby('string1').apply(lambda x :x['string2']+'aa')
TypeError: incompatible index of inserted column with frame index
Let us try to look at the return of groupby.apply
df.groupby('string1').apply(lambda x : x['string2']+'aa')
Out[466]:
string1
abc 0 vwxaa
ghi 1 jklaa
mno 2 dfeaa
stu 3 pqraa
Name: string2, dtype
Notice here it add the one more level into the index , so the return is multiple index ,and original df only have one dimension which will cause the error message .
How to fix it ?
reset
the index
and using the original index which is the second level of the groupby
product , then assign it back
df['String4']=df.groupby('string1').apply(lambda x : x['string2']+'aa').reset_index(level=0,drop=True)
df
Out[469]:
string1 string2 string3 String4
0 abc vwx NaN vwxaa
1 ghi jkl NaN jklaa
2 mno dfe NaN dfeaa
3 stu pqr NaN pqraa
As Erfan mentioned in the comment, how can we forbidden accidentally assign unwanted value to pandas.DataFrame
Two different ways of assign .
1st, with a array or list or tuple .. CANNOT ALIGN, which means when you have different length between df and assign object , it will fail
2nd assign with pandas
object
, ALWAYS aligns, no error will return, even the length different
However when the assign object have duplicated index , it will raise the error
df['string3']=pd.Series(['aaa','aaa','aaa','aaa'],index=[100,100,100,100]) ValueError: cannot reindex from a duplicate axis
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With