I am trying to change my old school codes for the elegant/fast Pandas codes, such as the one of this previous question:
Rolling operation slow performance to create a new column
I have 4 different codes that I would like to improve in performance with an elegant and fast code using Pandas.
The original dataframe (df) is similar to this (although much larger):
idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Apr-18', 'Apr-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'A', 'B', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'x': 1}, {'x': 5}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3},{'x': 1}, {'x': 6}, {'x': 3}, {'x': 5}, {'x': 2}, {'x': 3},{'x': 1}, {'x': 9}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3}, {'x': 6}, {'x': 8}, {'x': 2}, {'x': 7}, {'x': 9}]
df = pd.DataFrame(data, index=idx, columns=['x'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y')
df=df.set_index(['date','type'])
And it looks like this:
x
date type
2018-01-01 A 1
B 5
2018-02-01 B 3
2018-03-01 A 2
B 7
C 3
2018-04-01 A 1
B 6
2018-05-01 B 3
2018-06-01 A 5
B 2
C 3
2018-07-01 A 1
2018-08-01 B 9
C 3
2018-09-01 A 2
B 7
2018-10-01 C 3
A 6
B 8
2018-11-01 A 2
2018-12-01 B 7
C 9
My goal is to improve this slow for-loop code. Below my code:
df=df.reset_index()
df['y']=0
for j in df['date'].unique():
list_1=list(df['type'][df['date']==j].index)
df['y'][list_1]=np.mean(df['x'][df['date']==j])
The result looks like this:
date type x y
0 2018-01-01 A 1 3.000000
1 2018-01-01 B 5 3.000000
2 2018-02-01 B 3 3.000000
3 2018-03-01 A 2 4.000000
4 2018-03-01 B 7 4.000000
5 2018-03-01 C 3 4.000000
6 2018-04-01 A 1 3.500000
7 2018-04-01 B 6 3.500000
8 2018-05-01 B 3 3.000000
9 2018-06-01 A 5 3.333333
10 2018-06-01 B 2 3.333333
11 2018-06-01 C 3 3.333333
12 2018-07-01 A 1 1.000000
13 2018-08-01 B 9 6.000000
14 2018-08-01 C 3 6.000000
15 2018-09-01 A 2 4.500000
16 2018-09-01 B 7 4.500000
17 2018-10-01 C 3 5.666667
18 2018-10-01 A 6 5.666667
19 2018-10-01 B 8 5.666667
20 2018-11-01 A 2 2.000000
21 2018-12-01 B 7 8.000000
22 2018-12-01 C 9 8.000000
** I tried the following Pandas code, but it doesn't work (I still need to see more examples to figure out how it works):
df['y'] = df.groupby('date')['x'].mean().reset_index(level=2, drop=True).swaplevel(0,1)
My goal is to measure the number of types for every date.
My slow code is:
df=df.reset_index()
df['y']=0
for j in df['date'].unique():
list_1=list(df['type'][df['date']==j].index)
df['y'][list_1]=len(df['type'][df['date']==j])
The result looks like this:
date type x y
0 2018-01-01 A 1 2
1 2018-01-01 B 5 2
2 2018-02-01 B 3 1
3 2018-03-01 A 2 3
4 2018-03-01 B 7 3
5 2018-03-01 C 3 3
6 2018-04-01 A 1 2
7 2018-04-01 B 6 2
8 2018-05-01 B 3 1
9 2018-06-01 A 5 3
10 2018-06-01 B 2 3
11 2018-06-01 C 3 3
12 2018-07-01 A 1 1
13 2018-08-01 B 9 2
14 2018-08-01 C 3 2
15 2018-09-01 A 2 2
16 2018-09-01 B 7 2
17 2018-10-01 C 3 3
18 2018-10-01 A 6 3
19 2018-10-01 B 8 3
20 2018-11-01 A 2 1
21 2018-12-01 B 7 2
22 2018-12-01 C 9 2
My goal is to measure the number of type A for every date.
My slow code is the following:
df=df.reset_index()
df['z']=0
df['y']=0
for index,row in df.iterrows():
if row['type']=='A':
df['z'][index]=1
else:
df['z'][index]=0
for j in df['date'].unique():
list_1=list(df['type'][df['date']==j].index)
df['y'][list_1]=sum(df['z'][df['date']==j])
del df['z']
And it looks like this:
date type x y
0 2018-01-01 A 1 1
1 2018-01-01 B 5 1
2 2018-02-01 B 3 0
3 2018-03-01 A 2 1
4 2018-03-01 B 7 1
5 2018-03-01 C 3 1
6 2018-04-01 A 1 1
7 2018-04-01 B 6 1
8 2018-05-01 B 3 0
9 2018-06-01 A 5 1
10 2018-06-01 B 2 1
11 2018-06-01 C 3 1
12 2018-07-01 A 1 1
13 2018-08-01 B 9 0
14 2018-08-01 C 3 0
15 2018-09-01 A 2 1
16 2018-09-01 B 7 1
17 2018-10-01 C 3 1
18 2018-10-01 A 6 1
19 2018-10-01 B 8 1
20 2018-11-01 A 2 1
21 2018-12-01 B 7 0
22 2018-12-01 C 9 0
Please use the following dataframe (df1):
idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Apr-18', 'Apr-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'A', 'B', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'x': 10, 'z': 1}, {'x': 50, 'z': 0}, {'x': 30, 'z': 0}, {'x': 20, 'z': 0}, {'x': 70, 'z': 1}, {'x': 30, 'z': 1},{'x': 10, 'z': 1}, {'x': 60, 'z': 0}, {'x': 30, 'z': 0}, {'x': 50, 'z': 1}, {'x': 20, 'z': 0}, {'x': 30, 'z': 1},{'x': 10, 'z': 0}, {'x': 90, 'z': 1}, {'x': 30, 'z': 1}, {'x': 20, 'z': 1}, {'x': 70, 'z': 0}, {'x': 30, 'z': 0}, {'x': 60, 'z': 1}, {'x': 80, 'z': 1}, {'x': 20, 'z': 0}, {'x': 70, 'z': 0}, {'x': 90, 'z': 1}]
df1 = pd.DataFrame(data, index=idx, columns=['x','z'])
df1.index.names=['date','type']
df1=df1.reset_index()
df1['date'] = pd.to_datetime(df1['date'],format = '%b-%y')
df1=df1.set_index(['date','type'])
This dataframe (df1) looks like this:
x z
date type
2018-01-01 A 10 1
B 50 0
2018-02-01 B 30 0
2018-03-01 A 20 0
B 70 1
C 30 1
2018-04-01 A 10 1
B 60 0
2018-05-01 B 30 0
2018-06-01 A 50 1
B 20 0
C 30 1
2018-07-01 A 10 0
2018-08-01 B 90 1
C 30 1
2018-09-01 A 20 1
B 70 0
2018-10-01 C 30 0
A 60 1
B 80 1
2018-11-01 A 20 0
2018-12-01 B 70 0
C 90 1
My slow code is:
df1=df1.reset_index()
df1['h']=0
df1['k']=0
df1['y']=0
for index,row in df1.iterrows():
if row['type']=='A':
df1['h'][index]=1
else:
df1['h'][index]=0
for index,row in df1.iterrows():
if row['z']==1 and row['h']==1:
df1['k'][index]=1
else:
df1['k'][index]=0
for j in df1['date'].unique():
list_1=list(df1['type'][df1['date']==j].index)
df1['y'][list_1]=sum(df1['k'][df1['date']==j])
del df1['h']
del df1['k']
And it looks like this:
date type x z y
0 2018-01-01 A 10 1 1
1 2018-01-01 B 50 0 1
2 2018-02-01 B 30 0 0
3 2018-03-01 A 20 0 0
4 2018-03-01 B 70 1 0
5 2018-03-01 C 30 1 0
6 2018-04-01 A 10 1 1
7 2018-04-01 B 60 0 1
8 2018-05-01 B 30 0 0
9 2018-06-01 A 50 1 1
10 2018-06-01 B 20 0 1
11 2018-06-01 C 30 1 1
12 2018-07-01 A 10 0 0
13 2018-08-01 B 90 1 0
14 2018-08-01 C 30 1 0
15 2018-09-01 A 20 1 1
16 2018-09-01 B 70 0 1
17 2018-10-01 C 30 0 1
18 2018-10-01 A 60 1 1
19 2018-10-01 B 80 1 1
20 2018-11-01 A 20 0 0
21 2018-12-01 B 70 0 0
22 2018-12-01 C 90 1 0
Although the question seems long I know that the answers with Pandas can be in a few lines of code. If you can let me know how much faster are your codes than mine it would be very useful.
You are looking for groupby
+ transform
. Here the .transform
is key as it broadcasts the result back to all rows in the original DataFrame that belong to that group.
For the first we can transform on x
, simply.
For the unique types, it's faster to bring it to a column than it is to deal with the index, so assign
a column and calculate the # of unique values within group.
For the final two conditions, you can create a Boolean column to see if that row satisfies the condition, which you sum within groups.
#1 Get the mean of `x` by date
df['x_avg'] = df.groupby('date').x.transform('mean')
#2 Get the # of unique types.
df['N'] = (df.assign(TYPE = df.index.get_level_values('type'))
.groupby('date').TYPE.transform('nunique'))
#3 Get the number of Type == A within a group
df['num_A'] = (df.assign(eqA = (df.index.get_level_values('type') == 'A'))
.groupby('date').eqA.transform(sum).astype(int))
#4 Really just a slight extension of 3
df1['cond_4'] = (df1.assign(to_sum = ((df1.index.get_level_values('type') == 'A')
& (df1.z == 1)).astype(int))
.groupby('date').to_sum.transform(sum))
df
: x num_A x_avg N
date type
2018-01-01 A 1 1 3.000000 2
B 5 1 3.000000 2
2018-02-01 B 3 0 3.000000 1
2018-03-01 A 2 1 4.000000 3
B 7 1 4.000000 3
C 3 1 4.000000 3
2018-04-01 A 1 1 3.500000 2
B 6 1 3.500000 2
2018-05-01 B 3 0 3.000000 1
2018-06-01 A 5 1 3.333333 3
B 2 1 3.333333 3
C 3 1 3.333333 3
2018-07-01 A 1 1 1.000000 1
2018-08-01 B 9 0 6.000000 2
C 3 0 6.000000 2
2018-09-01 A 2 1 4.500000 2
B 7 1 4.500000 2
2018-10-01 C 3 1 5.666667 3
A 6 1 5.666667 3
B 8 1 5.666667 3
2018-11-01 A 2 1 2.000000 1
2018-12-01 B 7 0 8.000000 2
C 9 0 8.000000 2
For first case, you can try below:
df['y'] = df.groupby('date')['x'].transform(np.mean)
For last case:
df1 = df1.reset_index()
df1['h'] = np.where(df1.reset_index()['type'] == 'A', 1, 0)
df1['k'] = np.where((df1.h == 1) & (df1.z == 1), 1, 0)
df1.y = df1.groupby('date')['k'].transform(np.sum)
del df1['h']
del df1['k']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With