Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge and groupby between seperate dataframes

Tags:

python

pandas

I have two dataframes that I want to merge/groupby. They are below:

df_1


        words      start   stop
0            Oh,    6.72   7.21
1          okay,    7.26   8.01
2             go  12.82   12.90
3         ahead.   12.91  12.94
4             NaN  15.29  15.62
5             NaN  15.63  15.99
6             NaN  16.09  16.36
7             NaN  16.37  16.96
8             NaN  17.88  18.36
9             NaN  18.37  19.36

df_2

data     start        stop
10         1.0        3.5
14         4.0       8.5
11         9.0       13.5
12        14.0       20.5

I want to merge df_1.words onto df_2, but group all values in df_1.words where df_1.start is in between df_2.start and df_2.stop. It should look like this:

df_2

data     start        stop   words
10         1.0        3.5     NaN
14         4.0       8.5      Oh, okay,
11         9.0       13.5     go ahead.
12        14.0       20.5     NaN, NaN, NaN, NaN, NaN, NaN
like image 833
connor449 Avatar asked Dec 09 '19 21:12

connor449


People also ask

How do I merge Groupby in pandas?

Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.

Which are the 3 main ways of combining DataFrames together?

Combine data from multiple files into a single DataFrame using merge and concat. Combine two DataFrames using a unique ID found in both DataFrames. Employ to_csv to export a DataFrame in CSV format. Join DataFrames using common fields (join keys).

How do I combine multiple DataFrames?

Pandas merge() function is used to merge multiple Dataframes. We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes.

How do I merge columns from different DataFrames?

It is possible to join the different columns is using concat() method. DataFrame: It is dataframe name. axis: 0 refers to the row axis and1 refers the column axis. join: Type of join.


2 Answers

If the two dataframes are not too long, we can do a cross-join:

(df2.assign(dummy=1)
    .merge(df.assign(dummy=1), on='dummy',
           how='left', suffixes=['','_r']
          )
    .query('start<=start_r<=stop')
    .groupby(['data','start','stop'],as_index=False)
    .agg({'words':list})
)

Output:

   data  start  stop                           words
0    11    9.0  13.5                    [go, ahead.]
1    12   14.0  20.5  [nan, nan, nan, nan, nan, nan]
2    14    4.0   8.5                    [Oh,, okay,]
like image 134
Quang Hoang Avatar answered Oct 16 '22 11:10

Quang Hoang


If the bin edges do not overlap as in your example, use pd.cut, with an IntervalIndex to group the first DataFrame. This allows you to be closed on both edges. Then select from with the 'stop' column from df_2 to get the aggregated result.

import pandas as pd

idx = pd.Index([pd.Interval(*x, closed='both') for x in zip(df_2.start, df_2.stop)])

s = df_1.groupby(pd.cut(df_1.start, idx)).words.agg(list)

# Closed on both, can use `'stop'` to align
df_2['words'] = s[df_2.stop].to_list()

print(df_2)
   data  start  stop                           words
0    10    1.0   3.5                              []
1    14    4.0   8.5                    [Oh,, okay,]
2    11    9.0  13.5                    [go, ahead.]
3    12   14.0  20.5  [nan, nan, nan, nan, nan, nan]
like image 32
ALollz Avatar answered Oct 16 '22 12:10

ALollz