Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.
df1: Name Nonprofit Business Education X 1 1 0 Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ Z 0 0 0 Y 0 1 0 df2: Name Nonprofit Education Y 1 1 <- this df has the correct values. Z 1 1 pd.merge(df1, df2, on='Name', how='outer') Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y Y 1 1 1 1 1 Y 1 1 1 1 1 X 1 1 0 nan nan Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.
I want to replace zeros in df1 with the values in df2. Furthermore, I want all rows with the same Names to be changed according to df2.
Name Nonprofit Business Education Y 1 1 1 Y 1 1 1 X 1 1 0 Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.)
My existing solution does the following: I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.
pubunis_df = df2 sdf = df1 regex = str_to_regex(', '.join(pubunis_df.ORGS)) pubunis = searchnamesre(sdf, 'ORGS', regex) sdf.ix[pubunis.index, ['Education', 'Public']] = 1 searchnamesre(sdf, 'ORGS', regex)
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
Pandas replace multiple values in column replace. By using DataFrame. replace() method we will replace multiple values with multiple new strings or text for an individual DataFrame column. This method searches the entire Pandas DataFrame and replaces every specified value.
KSD's answer will raise error:
df1 = pd.DataFrame([["X",1,1,0], ["Y",0,1,0], ["Z",0,0,0], ["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"]) df2 = pd.DataFrame([["Y",1,1], ["Z",1,1]],columns=["Name","Nonprofit", "Education"]) df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values Out[851]: ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']] df1 Out[852]: Name Nonprofit Business Education 0 X 1.0 1 0.0 1 Y 1.0 1 1.0 2 Z NaN 0 NaN 3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.
Here is my answer:
df1 = df1.merge(df2,on='Name',how="left") df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x']) df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x']) df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1) df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
df1 = df1.set_index('Name') df2 = df2.set_index('Name') df1.update(df2) df1.reset_index(inplace=True)
More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.
Example:
df1 = pd.DataFrame([["X",1,1,0], ["Y",0,1,0], ["Z",0,0,0], ["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"]) df2 = pd.DataFrame([["Y",1,1], ["Z",1,1], ['U',1,3]],columns=["Name2","Nonprofit", "Education"]) df1 = df1.set_index('Name1') df2 = df2.set_index('Name2') df1.update(df2)
result:
Nonprofit Business Education Name1 X 1.0 1 0.0 Y 1.0 1 1.0 Z 1.0 0 1.0 Y 1.0 1 1.0
Use the boolean mask from isin
to filter the df and assign the desired row values from the rhs df:
In [27]: df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']] df Out[27]: Name Nonprofit Business Education 0 X 1 1 0 1 Y 1 1 1 2 Z 1 0 1 3 Y 1 1 1 [4 rows x 4 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With