Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Groupby two columns ignoring order of pairs

Suppose we have a dataframe that looks like this:

    start   stop   duration
0   A       B      1
1   B       A      2
2   C       D      2
3   D       C      0

What's the best way to construct a list of: i) start/stop pairs; ii) count of start/stop pairs; iii) avg duration of start/stop pairs? In this case, order should not matter: (A,B)=(B,A).

Desired output: [[start,stop,count,avg duration]]

In this example: [[A,B,2,1.5],[C,D,2,1]]

like image 739
Caerus Avatar asked Dec 07 '18 03:12

Caerus


People also ask

How to use groupby with multiple columns in pandas?

How to Use GroupBy with Multiple Columns in Pandas Step 1: Create sample DataFrame. You can find the sample data from the repository of the notebook or use the link below... Step 2: Group by multiple columns. The columns should be provided as a list to the groupby method. Step 3: GroupBy ...

How to group by multiple columns in a list?

In order to group by multiple columns you need to use the next syntax: The columns should be provided as a list to the groupby method. The object returned after the groupby of multiple columns depends on the usage of the groups. Let's check it by examples:

How to get the object after the groupby of multiple columns?

The columns should be provided as a list to the groupby method. The object returned after the groupby of multiple columns depends on the usage of the groups. Let's check it by examples: If you use a single column after the groupby you will get SeriesGroupBy otherwise you will have DataFrameGroupBy.

What is group by column in pyspark?

PYSPARK GROUPBY MULITPLE COLUMN is a function in PySpark that allows to group multiple rows together based on multiple columnar values in spark application. The Group By function is used to group data based on some conditions, and the final aggregated data is shown as a result.


1 Answers

sort the first two columns (you can do this in-place, or create a copy and do the same thing; I've done the former), then groupby and agg:

df[['start', 'stop']] = np.sort(df[['start', 'stop']], axis=1)

(df.groupby(['start','stop'])
   .duration
   .agg(['count', 'mean'])
   .reset_index()
   .values
   .tolist())
# [['A', 'B', 2, 1.5], ['C', 'D', 2, 1.0]]
like image 89
cs95 Avatar answered Oct 09 '22 09:10

cs95