I have two dataframes that I want to merge/groupby. They are below:
df_1
words start stop
0 Oh, 6.72 7.21
1 okay, 7.26 8.01
2 go 12.82 12.90
3 ahead. 12.91 12.94
4 NaN 15.29 15.62
5 NaN 15.63 15.99
6 NaN 16.09 16.36
7 NaN 16.37 16.96
8 NaN 17.88 18.36
9 NaN 18.37 19.36
df_2
data start stop
10 1.0 3.5
14 4.0 8.5
11 9.0 13.5
12 14.0 20.5
I want to merge df_1.words onto df_2, but group all values in df_1.words where df_1.start is in between df_2.start and df_2.stop. It should look like this:
df_2
data start stop words
10 1.0 3.5 NaN
14 4.0 8.5 Oh, okay,
11 9.0 13.5 go ahead.
12 14.0 20.5 NaN, NaN, NaN, NaN, NaN, NaN
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
Combine data from multiple files into a single DataFrame using merge and concat. Combine two DataFrames using a unique ID found in both DataFrames. Employ to_csv to export a DataFrame in CSV format. Join DataFrames using common fields (join keys).
Pandas merge() function is used to merge multiple Dataframes. We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes.
It is possible to join the different columns is using concat() method. DataFrame: It is dataframe name. axis: 0 refers to the row axis and1 refers the column axis. join: Type of join.
If the two dataframes are not too long, we can do a cross-join:
(df2.assign(dummy=1)
.merge(df.assign(dummy=1), on='dummy',
how='left', suffixes=['','_r']
)
.query('start<=start_r<=stop')
.groupby(['data','start','stop'],as_index=False)
.agg({'words':list})
)
Output:
data start stop words
0 11 9.0 13.5 [go, ahead.]
1 12 14.0 20.5 [nan, nan, nan, nan, nan, nan]
2 14 4.0 8.5 [Oh,, okay,]
If the bin edges do not overlap as in your example, use pd.cut
, with an IntervalIndex
to group the first DataFrame. This allows you to be closed on both edges. Then select from with the 'stop' column from df_2
to get the aggregated result.
import pandas as pd
idx = pd.Index([pd.Interval(*x, closed='both') for x in zip(df_2.start, df_2.stop)])
s = df_1.groupby(pd.cut(df_1.start, idx)).words.agg(list)
# Closed on both, can use `'stop'` to align
df_2['words'] = s[df_2.stop].to_list()
print(df_2)
data start stop words
0 10 1.0 3.5 []
1 14 4.0 8.5 [Oh,, okay,]
2 11 9.0 13.5 [go, ahead.]
3 12 14.0 20.5 [nan, nan, nan, nan, nan, nan]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With