merge rows pandas dataframe based on condition

Q: Is Iterrows faster than apply?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.

Q: How do I merge rows in a DataFrame in Python?

We can use the concat function in pandas to append either columns or rows from one DataFrame to another. Let's grab two subsets of our data to see how this works. When we concatenate DataFrames, we need to specify the axis. axis=0 tells pandas to stack the second DataFrame UNDER the first one.

Q: How do I merge two rows in DataFrame?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

Q: How do you use between conditions in Pandas?

Pandas Series: between() functionThe between() function is used to get boolean Series equivalent to left <= series <= right. This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False. Left boundary.

Tags:

python

pandas

dataframe

Hi have a dataframe df

containing a set of events (rows).

df = pd.DataFrame(data=[[1, 2,   7, 10],
                   [10, 22, 1, 30],
                   [30, 42, 2, 10],  
                   [100,142, 22,1],
                   [143, 152, 2, 10],
                   [160, 162, 12, 11]],columns=['Start','End','Value1','Value2'])

 df
Out[15]: 
   Start  End  Value1  Value2
0      1    2       7      10
1     10   22       1      30
2     30   42       2      10
3    100  142      22       1
4    143  152       2      10
5    160  162      12      11

If 2 (or more) consecutive events are <= 10 far apart I would like to merge the 2 (or more) events (i.e. use the start of the first event, end of the last and sum the values in Value1 and Value2).

In the example above df becomes:

 df
Out[15]: 
   Start  End  Value1  Value2
0      1   42      10      50
1    100  162      36      22

757

asked Oct 13 '17 14:10

gabboshow

1 Answers

That's totally possible:

df.groupby(((df.Start  - df.End.shift(1)) > 10).cumsum()).agg({'Start':min, 'End':max, 'Value1':sum, 'Value2': sum})

Explanation:

start_end_differences = df.Start  - df.End.shift(1) #shift moves the series down
threshold_selector = start_end_differences > 10 # will give you a boolean array where true indicates a point where the difference more than 10.
groups = threshold_selector.cumsum() # sums up the trues (1) and will create an integer series starting from 0
df.groupby(groups).agg({'Start':min}) # the aggregation is self explaining

Here is a generalized solution that remains agnostic of the other columns:

cols = df.columns.difference(['Start', 'End'])
grps = df.Start.sub(df.End.shift()).gt(10).cumsum()
gpby = df.groupby(grps)
gpby.agg(dict(Start='min', End='max')).join(gpby[cols].sum())

   Start  End  Value1  Value2
0      1   42      10      50
1    100  162      36      22

answered Sep 20 '22 17:09

Jan Zeiseweis

Related questions
                            
                                decorate __call__ with @staticmethod
                            
                                How to add bold and normal text in one line using drawString method in reportlab (python)
                            
                                Add to a deque being iterated in Python?
                            
                                How do you read a lambda function as a string?
                            
                                Subtracting pandas timestamps; absolute value
                            
                                PyMySQL returning old/snapshot values/not rerunning query?
                            
                                Plot pandas dataframe with subplots (subplots=True): Place legend and use tight layout
                            
                                Seaborn FacetGrid barplots and hue
                            
                                How to subtract a column of days from a column of dates in Pyspark?
                            
                                How to animate the colorbar in matplotlib
                            
                                Vertical line at the end of a CDF histogram using matplotlib
                            
                                Python regexp groups: how do I get all groups?
                            
                                How to specify the number of threads/processes for the default dask scheduler
                            
                                Pandas rolling standard deviation
                            
                                How to invert a regular expression in pandas filter function
                            
                                What preprocessing.scale() do? How does it work?
                            
                                pyinstaller error: OSError: Python library not found: libpython3.4mu.so.1.0, libpython3.4m.so.1.0, libpython3.4.so.1.0
                            
                                Is it OK to create very large tuples in Python?
                            
                                How can I create my own "parameterized" type in Python (like `Optional[T]`)?
                            
                                Write DataFrame to mysql table using pySpark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With