I have a data frame that I want to replace the values in one column, with values from another dataframe.
df = pd.DataFrame({'id1': [1001,1002,1001,1003,1004,1005,1002,1006],
                   'value1': ["a","b","c","d","e","f","g","h"],
                   'value3': ["yes","no","yes","no","no","no","yes","no"]})
dfReplace = pd.DataFrame({'id2': [1001,1002],
                   'value2': ["rep1","rep2"]})
I need to use a groupby with common key and current solution is with a loop. Is there a more elegant (faster) way to do this with .map(apply) etc. I wanted initial to use pd.update(), but doesn't seem the correct way.
groups = dfReplace.groupby(['id2'])
for key, group in groups:
    df.loc[df['id1']==key,'value1']=group['value2'].values
Output
df
    id1   value1 value3
0   1001  rep1   yes
1   1002  rep2   no
2   1001  rep1   yes
3   1003  d      no
4   1004  e      no
5   1005  f      no
6   1002  rep2   yes
7   1006  h      no
                You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.
try merge():
merge = df.merge(dfReplace, left_on='id1', right_on='id2', how='left')
print(merge)
merge.ix[(merge.id1 == merge.id2), 'value1'] = merge.value2
print(merge)
del merge['id2']
del merge['value2']
print(merge)
Output:
    id1 value1 value3   id2 value2
0  1001      a    yes  1001   rep1
1  1002      b     no  1002   rep2
2  1001      c    yes  1001   rep1
3  1003      d     no   NaN    NaN
4  1004      e     no   NaN    NaN
5  1005      f     no   NaN    NaN
6  1002      g    yes  1002   rep2
7  1006      h     no   NaN    NaN
    id1 value1 value3   id2 value2
0  1001   rep1    yes  1001   rep1
1  1002   rep2     no  1002   rep2
2  1001   rep1    yes  1001   rep1
3  1003      d     no   NaN    NaN
4  1004      e     no   NaN    NaN
5  1005      f     no   NaN    NaN
6  1002   rep2    yes  1002   rep2
7  1006      h     no   NaN    NaN
    id1 value1 value3
0  1001   rep1    yes
1  1002   rep2     no
2  1001   rep1    yes
3  1003      d     no
4  1004      e     no
5  1005      f     no
6  1002   rep2    yes
7  1006      h     no
                        This is a little cleaner if you already have the indexes set to id, but if not you can still do in one line:
>>> (dfReplace.set_index('id2').rename( columns = {'value2':'value1'} )
                               .combine_first(df.set_index('id1')))
     value1 value3
1001   rep1    yes
1001   rep1    yes
1002   rep2     no
1002   rep2    yes
1003      d     no
1004      e     no
1005      f     no
1006      h     no
If you separate into three lines and do the renaming and re-indexing separately, you can see that the combine_first() by itself is actually very simple:
>>> df = df.set_index('id1')
>>> dfReplace = dfReplace.set_index('id2').rename( columns={'value2':'value1'} )
>>> dfReplace.combine_first(df)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With