I have a data frame that I want to replace the values in one column, with values from another dataframe.
df = pd.DataFrame({'id1': [1001,1002,1001,1003,1004,1005,1002,1006],
'value1': ["a","b","c","d","e","f","g","h"],
'value3': ["yes","no","yes","no","no","no","yes","no"]})
dfReplace = pd.DataFrame({'id2': [1001,1002],
'value2': ["rep1","rep2"]})
I need to use a groupby with common key and current solution is with a loop. Is there a more elegant (faster) way to do this with .map(apply) etc. I wanted initial to use pd.update(), but doesn't seem the correct way.
groups = dfReplace.groupby(['id2'])
for key, group in groups:
df.loc[df['id1']==key,'value1']=group['value2'].values
Output
df
id1 value1 value3
0 1001 rep1 yes
1 1002 rep2 no
2 1001 rep1 yes
3 1003 d no
4 1004 e no
5 1005 f no
6 1002 rep2 yes
7 1006 h no
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.
try merge():
merge = df.merge(dfReplace, left_on='id1', right_on='id2', how='left')
print(merge)
merge.ix[(merge.id1 == merge.id2), 'value1'] = merge.value2
print(merge)
del merge['id2']
del merge['value2']
print(merge)
Output:
id1 value1 value3 id2 value2
0 1001 a yes 1001 rep1
1 1002 b no 1002 rep2
2 1001 c yes 1001 rep1
3 1003 d no NaN NaN
4 1004 e no NaN NaN
5 1005 f no NaN NaN
6 1002 g yes 1002 rep2
7 1006 h no NaN NaN
id1 value1 value3 id2 value2
0 1001 rep1 yes 1001 rep1
1 1002 rep2 no 1002 rep2
2 1001 rep1 yes 1001 rep1
3 1003 d no NaN NaN
4 1004 e no NaN NaN
5 1005 f no NaN NaN
6 1002 rep2 yes 1002 rep2
7 1006 h no NaN NaN
id1 value1 value3
0 1001 rep1 yes
1 1002 rep2 no
2 1001 rep1 yes
3 1003 d no
4 1004 e no
5 1005 f no
6 1002 rep2 yes
7 1006 h no
This is a little cleaner if you already have the indexes set to id, but if not you can still do in one line:
>>> (dfReplace.set_index('id2').rename( columns = {'value2':'value1'} )
.combine_first(df.set_index('id1')))
value1 value3
1001 rep1 yes
1001 rep1 yes
1002 rep2 no
1002 rep2 yes
1003 d no
1004 e no
1005 f no
1006 h no
If you separate into three lines and do the renaming and re-indexing separately, you can see that the combine_first()
by itself is actually very simple:
>>> df = df.set_index('id1')
>>> dfReplace = dfReplace.set_index('id2').rename( columns={'value2':'value1'} )
>>> dfReplace.combine_first(df)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With