I want to merge rows of dataframe with one common column value and then merge rest of the column values separated by comma for string values and convert to array/list for int values.
A B C D
1 one 100 value
4 four 400 value
5 five 500 value
2 two 200 value
Expecting result like:
A B C D
[1,4,5,2] one,four,five,two [100,400,500,200] value
I can use groupby for column D but how can I use apply for columns A,C as apply(np.array) and apply(','.join) for column B in df all at once?
Different column names are specified for merges in Pandas using the “left_on” and “right_on” parameters, instead of using only the “on” parameter. Merging dataframes with different names for the joining variable is achieved using the left_on and right_on arguments to the pandas merge function.
The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.
Use pandas. concat() to concatenate/merge two or multiple pandas DataFrames across rows or columns. When you concat() two pandas DataFrames on rows, it creates a new Dataframe containing all rows of two DataFrames basically it does append one DataFrame with another.
Here In first dataframe (dataframe1) , the columns ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are ['ID','Age']. Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.
Dynamic solution - strings columns are joined and numeric are converted to lists with GroupBy.agg
:
f = lambda x: x.tolist() if np.issubdtype(x.dtype, np.number) else ','.join(x)
#similar for test strings - https://stackoverflow.com/a/37727662
#f = lambda x: ','.join(x) if np.issubdtype(x.dtype, np.flexible) else x.tolist()
df1 = df.groupby('D').agg(f).reset_index().reindex(columns=df.columns)
print (df1)
A B C D
0 [1, 4, 5, 2] one,four,five,two [100, 400, 500, 200] value
Another solution is specify each functions separately for each column:
df2 = (df.groupby('D')
.agg({'A': lambda x: x.tolist(), 'B': ','.join, 'C':lambda x: x.tolist()})
.reset_index()
.reindex(columns=df.columns))
print (df2)
A B C D
0 [1, 4, 5, 2] one,four,five,two [100, 400, 500, 200] value
df = df.groupby('D').apply(lambda x: pd.Series([list(x.A),','.join(x.B),list(x.C)])).reset_index().rename({0:'A',1:'B',2:'C'}, axis=1)
df = df[['A','B','C','D']]
Output
A B C D
0 [1, 4, 5, 2] one,four,five,two [100, 400, 500, 200] value
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With