stop groupby from making 2 combination same pair in python dataframe

Tags:

I am working on IPL dataset from Kaggle (https://www.kaggle.com/manasgarg/ipl). I want to sum up the runs made by two people as pair and I have prepared my data. When I am trying a GROUPBY on the dataframe columns (batsman and non_striker) it is making 2 combination of the same pair. like (a,b) and (b,a) - rather I wish it should consider it as same. As I can't drop rows any further.

Click to copy

import pandas as pd

df = pd.read_csv("C:\\Users\\Yash\\AppData\\Local\\Programs\\Python\\Python36-32\\Machine Learning\\IPL\\deliveries.csv")
df = df[(df["is_super_over"] != 1)]
df["pri_key"] = df["match_id"].astype(str) + "-" + df["inning"].astype(str)
openners = df[(df["over"] == 1) & (df["ball"] == 1)]
openners = openners[["pri_key", "batsman", "non_striker"]]
openners = openners.rename(columns = {"batsman":"batter1", "non_striker":"batter2"})
df = pd.merge(df, openners, on="pri_key")
df = df[["batsman", "non_striker", "batter1", "batter2", "batsman_runs"]]
df = df[((df["batsman"] == df["batter1"]) | (df["batsman"] == df["batter2"])) 
    & ((df["non_striker"] == df["batter1"]) | (df["non_striker"] == df["batter2"]))]

df1 = df.groupby(["batsman" , "non_striker"], group_keys = False)["batsman_runs"].agg("sum")
df1.nlargest(10)

Result:
batsman      non_striker
DA Warner    S Dhawan       1294
S Dhawan     DA Warner       823
RV Uthappa   G Gambhir       781
DR Smith     BB McCullum     684
CH Gayle     V Kohli         674
MEK Hussey   M Vijay         666
M Vijay      MEK Hussey      629
G Gambhir    RV Uthappa      611
BB McCullum  DR Smith        593
CH Gayle     TM Dilshan      537

and, I want to keep 1 pair as unique

for those who don't understand cricket I have a dataframe

Click to copy

batsman    non_striker    runs
a              b            2
a              b            3
b              a            1
c              d            6
d              c            1
d              c            4
b              a            3
e              f            1
f              e            2
f              e            6

df1 = df.groupby(["batsman" , "non_striker"], group_keys = False)["batsman_runs"].agg("sum")
    df1.nlargest(30)

output:
batsman    non_striker    runs
  a            b            5
  b            a            4
  c            d            6
  d            c            5
  e            f            1
  f            e            8

expected output:
batsman    non_striker    runs
  a            b            9
  c            d            11
  e            f            9

what should I do? Please advise....

396

asked Nov 26 '18 19:11

Yash Mishra

2 Answers

You can sort the batsman and non_striker and then group the data

Click to copy

df[['batsman', 'non_striker']] = df[['batsman', 'non_striker']].apply(sorted, axis=1) 
df.groupby(['batsman', 'non_striker']).batsman_runs.sum().nlargest(10)

Edit: You can also use numpy for sorting the columns, which will be faster than using pandas sorted

Click to copy

df[['batsman', 'non_striker']] = np.sort(df[['batsman', 'non_striker']],1)
df.groupby(['batsman', 'non_striker'], sort = False).batsman_runs.sum().nlargest(10).sort_index()

Either way, you will get,

Click to copy

batsman         non_striker
CH Gayle        V Kohli        2650
DA Warner       S Dhawan       2242
AB de Villiers  V Kohli        2135
G Gambhir       RV Uthappa     1795
M Vijay         MEK Hussey     1302
BB McCullum     DR Smith       1277
KA Pollard      RG Sharma      1220
MEK Hussey      SK Raina       1129
AT Rayudu       RG Sharma      1121
AM Rahane       SR Watson      1118

169

answered Oct 26 '22 18:10

Vaishali

Craete a new DataFrame using np.sort. Then groupby and sum.

Click to copy

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.sort(df[['batsman', 'non_striker']].values,1), 
                   index=df.index,
                   columns=['player_1', 'player_2']).assign(runs = df.runs)

df1.groupby(['player_1', 'player_2']).runs.sum()

Output:

Click to copy

player_1  player_2
a         b            9
c         d           11
e         f            9
Name: runs, dtype: int64

answered Oct 26 '22 18:10

ALollz

Related questions
                            
                                Calling c++ function from python
                            
                                ECDF plot from a truncated MD5
                            
                                MultipleFileField wtforms
                            
                                Child thread keeps running even after main thread crashes
                            
                                How to manipulate an object of Google Ads API's Enum class - python
                            
                                TensorFlow Probability MCMC with Bernoulli distribution
                            
                                3-D Matrix Multiplication in Numpy
                            
                                Confusion about when NumPy array slices are references and when they are copies
                            
                                What replaces text.latex.unicode?
                            
                                Always same output for tensorflow autoencoder
                            
                                Get JSON using Python and AsyncIO
                            
                                Useless if statement in yum source [closed]
                            
                                When writing a python extension in C, how does one pass a python function in to a C function?
                            
                                Upload File to Google-drive Teamdrive folder with PyDrive
                            
                                How to run django test when managed=False
                            
                                Flask APScheduler + Gunicorn workers - Still running task twice after socket fix
                            
                                How to undo or reverse np.meshgrid?
                            
                                KeyError When Assigning Dictionary Keys and Values
                            
                                Converting Tensorflow Graph to use Estimator, get 'TypeError: data type not understood' at loss function using `sampled_softmax_loss` or `nce_loss`
                            
                                Convert time object to datetime format in python pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

stop groupby from making 2 combination same pair in python dataframe

Tags:

python

pandas

Yash Mishra

People also ask

2 Answers

Vaishali

Output:

ALollz

Recent Activity

Donate For Us