Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does my new column does net get assigned after using .sample method?

So I was just answering a question and I came across something interesting:

The dataframe looks like this:

  string1 string2
0     abc     def
1     ghi     jkl
2     mno     pqr
3     stu     vwx

So when I do the following, the assigning of new columns works:

df['string3'] = df.string2

print(df)

  string1 string2 string3
0     abc     def     def
1     ghi     jkl     jkl
2     mno     pqr     pqr
3     stu     vwx     vwx

But when I use pandas.DataFrame.Series.sample, the new column does net get assigned, at least not the sampled one:

df['string4'] = df.string2.sample(len(df.string2))
print(df)
  string1 string2 string3 string4
0     abc     def     def     def
1     ghi     jkl     jkl     jkl
2     mno     pqr     pqr     pqr
3     stu     vwx     vwx     vwx

So I tested some things:

Test1 Using sample without assign, gives us correct output:

df.string2.sample(len(df.string2))

2    pqr
1    jkl
0    def
3    vwx
Name: string2, dtype: object

Test2 Cannot overwrite either:

df['string2'] = df.string2.sample(len(df.string2))
print(df)
  string1 string2
0     abc     def
1     ghi     jkl
2     mno     pqr
3     stu     vwx

This works but why?

df['string2'] = df.string2.sample(len(df.string2)).values
print(df)
  string1 string2
0     abc     jkl
1     ghi     def
2     mno     vwx
3     stu     pqr

Why do I need to explicitly use .values or .tolist() to get the assigning correct?

like image 795
Erfan Avatar asked Mar 04 '23 06:03

Erfan


1 Answers

pandas is index sensitive , which means they check the index when assign it , that is when you do the serise assign , the whole df not change , since the index is not change , after sort_index, it still show the same order of values, but if you do the numpy array assignment , the index will not be considered , so that the value itself will be assign back to the original df , which yield the output

An example of egde

df['string3']=pd.Series(['aaa','aaa','aaa','aaa'],index=[100,111,112,113])
df
Out[462]: 
  string1 string2 string3
0     abc     vwx     NaN
1     ghi     jkl     NaN
2     mno     dfe     NaN
3     stu     pqr     NaN

Because of that index sensitive when you do condition assignment with.loc

You can always do

df.loc[df.condition,'value']=df.value*100 
# since the not selected one will not be change 

Just same to what you do with np.where

df['value']=np.where(df.condition,df.value*100 ,df.value)

Some other use case when I do groupby apply with none-agg function and try to assign it back ,why it is failed

df['String4']=df.groupby('string1').apply(lambda x :x['string2']+'aa')

TypeError: incompatible index of inserted column with frame index

Let us try to look at the return of groupby.apply

df.groupby('string1').apply(lambda x : x['string2']+'aa')
Out[466]: 
string1   
abc      0    vwxaa
ghi      1    jklaa
mno      2    dfeaa
stu      3    pqraa
Name: string2, dtype

Notice here it add the one more level into the index , so the return is multiple index ,and original df only have one dimension which will cause the error message .


How to fix it ?


reset the index and using the original index which is the second level of the groupby product , then assign it back

df['String4']=df.groupby('string1').apply(lambda x : x['string2']+'aa').reset_index(level=0,drop=True)
df
Out[469]: 
  string1 string2 string3 String4
0     abc     vwx     NaN   vwxaa
1     ghi     jkl     NaN   jklaa
2     mno     dfe     NaN   dfeaa
3     stu     pqr     NaN   pqraa

As Erfan mentioned in the comment, how can we forbidden accidentally assign unwanted value to pandas.DataFrame

Two different ways of assign .

1st, with a array or list or tuple .. CANNOT ALIGN, which means when you have different length between df and assign object , it will fail

2nd assign with pandas object, ALWAYS aligns, no error will return, even the length different

However when the assign object have duplicated index , it will raise the error

df['string3']=pd.Series(['aaa','aaa','aaa','aaa'],index=[100,100,100,100])
ValueError: cannot reindex from a duplicate axis
like image 91
BENY Avatar answered Mar 07 '23 00:03

BENY