Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

stop groupby from making 2 combination same pair in python dataframe

Tags:

python

pandas

I am working on IPL dataset from Kaggle (https://www.kaggle.com/manasgarg/ipl). I want to sum up the runs made by two people as pair and I have prepared my data. When I am trying a GROUPBY on the dataframe columns (batsman and non_striker) it is making 2 combination of the same pair. like (a,b) and (b,a) - rather I wish it should consider it as same. As I can't drop rows any further.

import pandas as pd

df = pd.read_csv("C:\\Users\\Yash\\AppData\\Local\\Programs\\Python\\Python36-32\\Machine Learning\\IPL\\deliveries.csv")
df = df[(df["is_super_over"] != 1)]
df["pri_key"] = df["match_id"].astype(str) + "-" + df["inning"].astype(str)
openners = df[(df["over"] == 1) & (df["ball"] == 1)]
openners = openners[["pri_key", "batsman", "non_striker"]]
openners = openners.rename(columns = {"batsman":"batter1", "non_striker":"batter2"})
df = pd.merge(df, openners, on="pri_key")
df = df[["batsman", "non_striker", "batter1", "batter2", "batsman_runs"]]
df = df[((df["batsman"] == df["batter1"]) | (df["batsman"] == df["batter2"])) 
    & ((df["non_striker"] == df["batter1"]) | (df["non_striker"] == df["batter2"]))]

df1 = df.groupby(["batsman" , "non_striker"], group_keys = False)["batsman_runs"].agg("sum")
df1.nlargest(10)

Result:
batsman      non_striker
DA Warner    S Dhawan       1294
S Dhawan     DA Warner       823
RV Uthappa   G Gambhir       781
DR Smith     BB McCullum     684
CH Gayle     V Kohli         674
MEK Hussey   M Vijay         666
M Vijay      MEK Hussey      629
G Gambhir    RV Uthappa      611
BB McCullum  DR Smith        593
CH Gayle     TM Dilshan      537

and, I want to keep 1 pair as unique

for those who don't understand cricket I have a dataframe

batsman    non_striker    runs
a              b            2
a              b            3
b              a            1
c              d            6
d              c            1
d              c            4
b              a            3
e              f            1
f              e            2
f              e            6

df1 = df.groupby(["batsman" , "non_striker"], group_keys = False)["batsman_runs"].agg("sum")
    df1.nlargest(30)

output:
batsman    non_striker    runs
  a            b            5
  b            a            4
  c            d            6
  d            c            5
  e            f            1
  f            e            8

expected output:
batsman    non_striker    runs
  a            b            9
  c            d            11
  e            f            9

what should I do? Please advise....

like image 396
Yash Mishra Avatar asked Nov 26 '18 19:11

Yash Mishra


People also ask

How do you split a Groupby in pandas?

Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.

What does Group_by do in pandas?

What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.

Does pandas Groupby return series?

When the series are of different lengths, it returns a multi-indexed series. This returns a a Series object. However, if every series has the same length, then it pivots this into a DataFrame .

How do I find a unique pair in a data frame?

To find the unique pair combinations of an R data frame column values, we can use combn function along with unique function.


2 Answers

You can sort the batsman and non_striker and then group the data

df[['batsman', 'non_striker']] = df[['batsman', 'non_striker']].apply(sorted, axis=1) 
df.groupby(['batsman', 'non_striker']).batsman_runs.sum().nlargest(10)

Edit: You can also use numpy for sorting the columns, which will be faster than using pandas sorted

df[['batsman', 'non_striker']] = np.sort(df[['batsman', 'non_striker']],1)
df.groupby(['batsman', 'non_striker'], sort = False).batsman_runs.sum().nlargest(10).sort_index()

Either way, you will get,

batsman         non_striker
CH Gayle        V Kohli        2650
DA Warner       S Dhawan       2242
AB de Villiers  V Kohli        2135
G Gambhir       RV Uthappa     1795
M Vijay         MEK Hussey     1302
BB McCullum     DR Smith       1277
KA Pollard      RG Sharma      1220
MEK Hussey      SK Raina       1129
AT Rayudu       RG Sharma      1121
AM Rahane       SR Watson      1118
like image 169
Vaishali Avatar answered Oct 26 '22 18:10

Vaishali


Craete a new DataFrame using np.sort. Then groupby and sum.

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.sort(df[['batsman', 'non_striker']].values,1), 
                   index=df.index,
                   columns=['player_1', 'player_2']).assign(runs = df.runs)

df1.groupby(['player_1', 'player_2']).runs.sum()

Output:

player_1  player_2
a         b            9
c         d           11
e         f            9
Name: runs, dtype: int64
like image 1
ALollz Avatar answered Oct 26 '22 18:10

ALollz