I am working on IPL dataset from Kaggle (https://www.kaggle.com/manasgarg/ipl). I want to sum up the runs made by two people as pair and I have prepared my data. When I am trying a GROUPBY on the dataframe columns (batsman and non_striker) it is making 2 combination of the same pair. like (a,b) and (b,a) - rather I wish it should consider it as same. As I can't drop rows any further.
import pandas as pd
df = pd.read_csv("C:\\Users\\Yash\\AppData\\Local\\Programs\\Python\\Python36-32\\Machine Learning\\IPL\\deliveries.csv")
df = df[(df["is_super_over"] != 1)]
df["pri_key"] = df["match_id"].astype(str) + "-" + df["inning"].astype(str)
openners = df[(df["over"] == 1) & (df["ball"] == 1)]
openners = openners[["pri_key", "batsman", "non_striker"]]
openners = openners.rename(columns = {"batsman":"batter1", "non_striker":"batter2"})
df = pd.merge(df, openners, on="pri_key")
df = df[["batsman", "non_striker", "batter1", "batter2", "batsman_runs"]]
df = df[((df["batsman"] == df["batter1"]) | (df["batsman"] == df["batter2"]))
& ((df["non_striker"] == df["batter1"]) | (df["non_striker"] == df["batter2"]))]
df1 = df.groupby(["batsman" , "non_striker"], group_keys = False)["batsman_runs"].agg("sum")
df1.nlargest(10)
Result:
batsman non_striker
DA Warner S Dhawan 1294
S Dhawan DA Warner 823
RV Uthappa G Gambhir 781
DR Smith BB McCullum 684
CH Gayle V Kohli 674
MEK Hussey M Vijay 666
M Vijay MEK Hussey 629
G Gambhir RV Uthappa 611
BB McCullum DR Smith 593
CH Gayle TM Dilshan 537
and, I want to keep 1 pair as unique
for those who don't understand cricket I have a dataframe
batsman non_striker runs
a b 2
a b 3
b a 1
c d 6
d c 1
d c 4
b a 3
e f 1
f e 2
f e 6
df1 = df.groupby(["batsman" , "non_striker"], group_keys = False)["batsman_runs"].agg("sum")
df1.nlargest(30)
output:
batsman non_striker runs
a b 5
b a 4
c d 6
d c 5
e f 1
f e 8
expected output:
batsman non_striker runs
a b 9
c d 11
e f 9
what should I do? Please advise....
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
When the series are of different lengths, it returns a multi-indexed series. This returns a a Series object. However, if every series has the same length, then it pivots this into a DataFrame .
To find the unique pair combinations of an R data frame column values, we can use combn function along with unique function.
You can sort the batsman and non_striker and then group the data
df[['batsman', 'non_striker']] = df[['batsman', 'non_striker']].apply(sorted, axis=1)
df.groupby(['batsman', 'non_striker']).batsman_runs.sum().nlargest(10)
Edit: You can also use numpy for sorting the columns, which will be faster than using pandas sorted
df[['batsman', 'non_striker']] = np.sort(df[['batsman', 'non_striker']],1)
df.groupby(['batsman', 'non_striker'], sort = False).batsman_runs.sum().nlargest(10).sort_index()
Either way, you will get,
batsman non_striker
CH Gayle V Kohli 2650
DA Warner S Dhawan 2242
AB de Villiers V Kohli 2135
G Gambhir RV Uthappa 1795
M Vijay MEK Hussey 1302
BB McCullum DR Smith 1277
KA Pollard RG Sharma 1220
MEK Hussey SK Raina 1129
AT Rayudu RG Sharma 1121
AM Rahane SR Watson 1118
Craete a new DataFrame
using np.sort
. Then groupby and sum.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.sort(df[['batsman', 'non_striker']].values,1),
index=df.index,
columns=['player_1', 'player_2']).assign(runs = df.runs)
df1.groupby(['player_1', 'player_2']).runs.sum()
player_1 player_2
a b 9
c d 11
e f 9
Name: runs, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With