I am working on IPL dataset from Kaggle (https://www.kaggle.com/manasgarg/ipl). I want to sum up the runs made by two people as pair and I have prepared my data. When I am trying a GROUPBY on the dataframe columns (batsman and non_striker) it is making 2 combination of the same pair. like (a,b) and (b,a) - rather I wish it should consider it as same. As I can't drop rows any further.
import pandas as pd
df = pd.read_csv("C:\\Users\\Yash\\AppData\\Local\\Programs\\Python\\Python36-32\\Machine Learning\\IPL\\deliveries.csv")
df = df[(df["is_super_over"] != 1)]
df["pri_key"] = df["match_id"].astype(str) + "-" + df["inning"].astype(str)
openners = df[(df["over"] == 1) & (df["ball"] == 1)]
openners = openners[["pri_key", "batsman", "non_striker"]]
openners = openners.rename(columns = {"batsman":"batter1", "non_striker":"batter2"})
df = pd.merge(df, openners, on="pri_key")
df = df[["batsman", "non_striker", "batter1", "batter2", "batsman_runs"]]
df = df[((df["batsman"] == df["batter1"]) | (df["batsman"] == df["batter2"])) 
    & ((df["non_striker"] == df["batter1"]) | (df["non_striker"] == df["batter2"]))]
df1 = df.groupby(["batsman" , "non_striker"], group_keys = False)["batsman_runs"].agg("sum")
df1.nlargest(10)
Result:
batsman      non_striker
DA Warner    S Dhawan       1294
S Dhawan     DA Warner       823
RV Uthappa   G Gambhir       781
DR Smith     BB McCullum     684
CH Gayle     V Kohli         674
MEK Hussey   M Vijay         666
M Vijay      MEK Hussey      629
G Gambhir    RV Uthappa      611
BB McCullum  DR Smith        593
CH Gayle     TM Dilshan      537
and, I want to keep 1 pair as unique
for those who don't understand cricket I have a dataframe
batsman    non_striker    runs
a              b            2
a              b            3
b              a            1
c              d            6
d              c            1
d              c            4
b              a            3
e              f            1
f              e            2
f              e            6
df1 = df.groupby(["batsman" , "non_striker"], group_keys = False)["batsman_runs"].agg("sum")
    df1.nlargest(30)
output:
batsman    non_striker    runs
  a            b            5
  b            a            4
  c            d            6
  d            c            5
  e            f            1
  f            e            8
expected output:
batsman    non_striker    runs
  a            b            9
  c            d            11
  e            f            9
what should I do? Please advise....
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
When the series are of different lengths, it returns a multi-indexed series. This returns a a Series object. However, if every series has the same length, then it pivots this into a DataFrame .
To find the unique pair combinations of an R data frame column values, we can use combn function along with unique function.
You can sort the batsman and non_striker and then group the data
df[['batsman', 'non_striker']] = df[['batsman', 'non_striker']].apply(sorted, axis=1) 
df.groupby(['batsman', 'non_striker']).batsman_runs.sum().nlargest(10)
Edit: You can also use numpy for sorting the columns, which will be faster than using pandas sorted
df[['batsman', 'non_striker']] = np.sort(df[['batsman', 'non_striker']],1)
df.groupby(['batsman', 'non_striker'], sort = False).batsman_runs.sum().nlargest(10).sort_index()
Either way, you will get,
batsman         non_striker
CH Gayle        V Kohli        2650
DA Warner       S Dhawan       2242
AB de Villiers  V Kohli        2135
G Gambhir       RV Uthappa     1795
M Vijay         MEK Hussey     1302
BB McCullum     DR Smith       1277
KA Pollard      RG Sharma      1220
MEK Hussey      SK Raina       1129
AT Rayudu       RG Sharma      1121
AM Rahane       SR Watson      1118
                        Craete a new DataFrame using np.sort. Then groupby and sum. 
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.sort(df[['batsman', 'non_striker']].values,1), 
                   index=df.index,
                   columns=['player_1', 'player_2']).assign(runs = df.runs)
df1.groupby(['player_1', 'player_2']).runs.sum()
player_1  player_2
a         b            9
c         d           11
e         f            9
Name: runs, dtype: int64
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With