Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge rows in dataframe with different columns?

I want to merge rows of dataframe with one common column value and then merge rest of the column values separated by comma for string values and convert to array/list for int values.

A   B     C    D
1  one   100  value
4  four  400  value
5  five  500  value
2  two   200  value

Expecting result like:

   A                B                 C            D
[1,4,5,2]  one,four,five,two  [100,400,500,200]  value

I can use groupby for column D but how can I use apply for columns A,C as apply(np.array) and apply(','.join) for column B in df all at once?

like image 285
k92 Avatar asked Jun 25 '19 05:06

k92


People also ask

How do I merge data frames with different column names?

Different column names are specified for merges in Pandas using the “left_on” and “right_on” parameters, instead of using only the “on” parameter. Merging dataframes with different names for the joining variable is achieved using the left_on and right_on arguments to the pandas merge function.

How do I merge two rows in DataFrame?

The concat() function in pandas is used to append either columns or rows from one DataFrame to another. The concat() function does all the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

How do I concatenate rows in pandas?

Use pandas. concat() to concatenate/merge two or multiple pandas DataFrames across rows or columns. When you concat() two pandas DataFrames on rows, it creates a new Dataframe containing all rows of two DataFrames basically it does append one DataFrame with another.

How do I merge two DataFrames with different columns in spark?

Here In first dataframe (dataframe1) , the columns ['ID', 'NAME', 'Address'] and second dataframe (dataframe2 ) columns are ['ID','Age']. Now we have to add the Age column to the first dataframe and NAME and Address in the second dataframe, we can do this by using lit() function. This function is available in pyspark.


2 Answers

Dynamic solution - strings columns are joined and numeric are converted to lists with GroupBy.agg:

f = lambda x: x.tolist() if np.issubdtype(x.dtype, np.number) else ','.join(x)
#similar for test strings - https://stackoverflow.com/a/37727662
#f = lambda x: ','.join(x) if np.issubdtype(x.dtype, np.flexible) else x.tolist()
df1 = df.groupby('D').agg(f).reset_index().reindex(columns=df.columns)
print (df1)
              A                  B                     C      D
0  [1, 4, 5, 2]  one,four,five,two  [100, 400, 500, 200]  value

Another solution is specify each functions separately for each column:

df2 = (df.groupby('D')
        .agg({'A': lambda x: x.tolist(), 'B': ','.join, 'C':lambda x: x.tolist()})
        .reset_index()
        .reindex(columns=df.columns))

print (df2)

              A                  B                     C      D
0  [1, 4, 5, 2]  one,four,five,two  [100, 400, 500, 200]  value
like image 152
jezrael Avatar answered Nov 01 '22 12:11

jezrael


df = df.groupby('D').apply(lambda x: pd.Series([list(x.A),','.join(x.B),list(x.C)])).reset_index().rename({0:'A',1:'B',2:'C'}, axis=1)

df = df[['A','B','C','D']]

Output

              A                  B                     C      D
0  [1, 4, 5, 2]  one,four,five,two  [100, 400, 500, 200]  value
like image 22
iamklaus Avatar answered Nov 01 '22 12:11

iamklaus