Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace Iterations by elegant Pandas code

Tags:

I am trying to change my old school codes for the elegant/fast Pandas codes, such as the one of this previous question:

Rolling operation slow performance to create a new column

I have 4 different codes that I would like to improve in performance with an elegant and fast code using Pandas.

1) Mean by date for all types:

The original dataframe (df) is similar to this (although much larger):

idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Apr-18', 'Apr-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'A', 'B', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'x': 1}, {'x': 5}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3},{'x': 1}, {'x': 6}, {'x': 3}, {'x': 5}, {'x': 2}, {'x': 3},{'x': 1}, {'x': 9}, {'x': 3}, {'x': 2}, {'x': 7}, {'x': 3}, {'x': 6}, {'x': 8}, {'x': 2}, {'x': 7}, {'x': 9}]
df = pd.DataFrame(data, index=idx, columns=['x'])
df.index.names=['date','type']
df=df.reset_index()
df['date'] = pd.to_datetime(df['date'],format = '%b-%y')
df=df.set_index(['date','type'])

And it looks like this:

                 x
date       type
2018-01-01 A     1
           B     5
2018-02-01 B     3
2018-03-01 A     2
           B     7
           C     3
2018-04-01 A     1
           B     6
2018-05-01 B     3
2018-06-01 A     5
           B     2
           C     3
2018-07-01 A     1
2018-08-01 B     9
           C     3
2018-09-01 A     2
           B     7
2018-10-01 C     3
           A     6
           B     8
2018-11-01 A     2
2018-12-01 B     7
           C     9

My goal is to improve this slow for-loop code. Below my code:

df=df.reset_index()
df['y']=0
for j in df['date'].unique():
    list_1=list(df['type'][df['date']==j].index)
    df['y'][list_1]=np.mean(df['x'][df['date']==j])

The result looks like this:

         date type  x         y
0  2018-01-01    A  1  3.000000
1  2018-01-01    B  5  3.000000
2  2018-02-01    B  3  3.000000
3  2018-03-01    A  2  4.000000
4  2018-03-01    B  7  4.000000
5  2018-03-01    C  3  4.000000
6  2018-04-01    A  1  3.500000
7  2018-04-01    B  6  3.500000
8  2018-05-01    B  3  3.000000
9  2018-06-01    A  5  3.333333
10 2018-06-01    B  2  3.333333
11 2018-06-01    C  3  3.333333
12 2018-07-01    A  1  1.000000
13 2018-08-01    B  9  6.000000
14 2018-08-01    C  3  6.000000
15 2018-09-01    A  2  4.500000
16 2018-09-01    B  7  4.500000
17 2018-10-01    C  3  5.666667
18 2018-10-01    A  6  5.666667
19 2018-10-01    B  8  5.666667
20 2018-11-01    A  2  2.000000
21 2018-12-01    B  7  8.000000
22 2018-12-01    C  9  8.000000

** I tried the following Pandas code, but it doesn't work (I still need to see more examples to figure out how it works):

df['y'] = df.groupby('date')['x'].mean().reset_index(level=2, drop=True).swaplevel(0,1)

2) Observations by date for all types (uses the same dataframe (df)):

My goal is to measure the number of types for every date.

My slow code is:

df=df.reset_index()
df['y']=0
for j in df['date'].unique():
    list_1=list(df['type'][df['date']==j].index)
    df['y'][list_1]=len(df['type'][df['date']==j])

The result looks like this:

         date type  x  y
0  2018-01-01    A  1  2
1  2018-01-01    B  5  2
2  2018-02-01    B  3  1
3  2018-03-01    A  2  3
4  2018-03-01    B  7  3
5  2018-03-01    C  3  3
6  2018-04-01    A  1  2
7  2018-04-01    B  6  2
8  2018-05-01    B  3  1
9  2018-06-01    A  5  3
10 2018-06-01    B  2  3
11 2018-06-01    C  3  3
12 2018-07-01    A  1  1
13 2018-08-01    B  9  2
14 2018-08-01    C  3  2
15 2018-09-01    A  2  2
16 2018-09-01    B  7  2
17 2018-10-01    C  3  3
18 2018-10-01    A  6  3
19 2018-10-01    B  8  3
20 2018-11-01    A  2  1
21 2018-12-01    B  7  2
22 2018-12-01    C  9  2

3) Observations by date for 'A' type (uses the same dataframe (df)):

My goal is to measure the number of type A for every date.

My slow code is the following:

df=df.reset_index()
df['z']=0
df['y']=0

for index,row in df.iterrows():
    if row['type']=='A':
        df['z'][index]=1
    else:
        df['z'][index]=0

for j in df['date'].unique():
    list_1=list(df['type'][df['date']==j].index)
    df['y'][list_1]=sum(df['z'][df['date']==j])

del df['z']

And it looks like this:

         date type  x  y
0  2018-01-01    A  1  1
1  2018-01-01    B  5  1
2  2018-02-01    B  3  0
3  2018-03-01    A  2  1
4  2018-03-01    B  7  1
5  2018-03-01    C  3  1
6  2018-04-01    A  1  1
7  2018-04-01    B  6  1
8  2018-05-01    B  3  0
9  2018-06-01    A  5  1
10 2018-06-01    B  2  1
11 2018-06-01    C  3  1
12 2018-07-01    A  1  1
13 2018-08-01    B  9  0
14 2018-08-01    C  3  0
15 2018-09-01    A  2  1
16 2018-09-01    B  7  1
17 2018-10-01    C  3  1
18 2018-10-01    A  6  1
19 2018-10-01    B  8  1
20 2018-11-01    A  2  1
21 2018-12-01    B  7  0
22 2018-12-01    C  9  0

4) Observations for 'A' type for which its z value = 1:

Please use the following dataframe (df1):

idx = [np.array(['Jan-18', 'Jan-18', 'Feb-18', 'Mar-18', 'Mar-18', 'Mar-18','Apr-18', 'Apr-18', 'May-18', 'Jun-18', 'Jun-18', 'Jun-18','Jul-18', 'Aug-18', 'Aug-18', 'Sep-18', 'Sep-18', 'Oct-18','Oct-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Dec-18',]),np.array(['A', 'B', 'B', 'A', 'B', 'C', 'A', 'B', 'B', 'A', 'B', 'C','A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'A', 'B', 'C'])]
data = [{'x': 10, 'z': 1}, {'x': 50, 'z': 0}, {'x': 30, 'z': 0}, {'x': 20, 'z': 0}, {'x': 70, 'z': 1}, {'x': 30, 'z': 1},{'x': 10, 'z': 1}, {'x': 60, 'z': 0}, {'x': 30, 'z': 0}, {'x': 50, 'z': 1}, {'x': 20, 'z': 0}, {'x': 30, 'z': 1},{'x': 10, 'z': 0}, {'x': 90, 'z': 1}, {'x': 30, 'z': 1}, {'x': 20, 'z': 1}, {'x': 70, 'z': 0}, {'x': 30, 'z': 0}, {'x': 60, 'z': 1}, {'x': 80, 'z': 1}, {'x': 20, 'z': 0}, {'x': 70, 'z': 0}, {'x': 90, 'z': 1}]
df1 = pd.DataFrame(data, index=idx, columns=['x','z'])
df1.index.names=['date','type']
df1=df1.reset_index()
df1['date'] = pd.to_datetime(df1['date'],format = '%b-%y')
df1=df1.set_index(['date','type'])

This dataframe (df1) looks like this:

                  x  z
date       type
2018-01-01 A     10  1
           B     50  0
2018-02-01 B     30  0
2018-03-01 A     20  0
           B     70  1
           C     30  1
2018-04-01 A     10  1
           B     60  0
2018-05-01 B     30  0
2018-06-01 A     50  1
           B     20  0
           C     30  1
2018-07-01 A     10  0
2018-08-01 B     90  1
           C     30  1
2018-09-01 A     20  1
           B     70  0
2018-10-01 C     30  0
           A     60  1
           B     80  1
2018-11-01 A     20  0
2018-12-01 B     70  0
           C     90  1

My slow code is:

df1=df1.reset_index()
df1['h']=0
df1['k']=0
df1['y']=0

for index,row in df1.iterrows():
    if row['type']=='A':
        df1['h'][index]=1
    else:
        df1['h'][index]=0

for index,row in df1.iterrows():
    if row['z']==1 and row['h']==1:
        df1['k'][index]=1
    else:
        df1['k'][index]=0   

for j in df1['date'].unique():
    list_1=list(df1['type'][df1['date']==j].index)
    df1['y'][list_1]=sum(df1['k'][df1['date']==j])

del df1['h']
del df1['k']

And it looks like this:

         date type   x  z  y
0  2018-01-01    A  10  1  1
1  2018-01-01    B  50  0  1
2  2018-02-01    B  30  0  0
3  2018-03-01    A  20  0  0
4  2018-03-01    B  70  1  0
5  2018-03-01    C  30  1  0
6  2018-04-01    A  10  1  1
7  2018-04-01    B  60  0  1
8  2018-05-01    B  30  0  0
9  2018-06-01    A  50  1  1
10 2018-06-01    B  20  0  1
11 2018-06-01    C  30  1  1
12 2018-07-01    A  10  0  0
13 2018-08-01    B  90  1  0
14 2018-08-01    C  30  1  0
15 2018-09-01    A  20  1  1
16 2018-09-01    B  70  0  1
17 2018-10-01    C  30  0  1
18 2018-10-01    A  60  1  1
19 2018-10-01    B  80  1  1
20 2018-11-01    A  20  0  0
21 2018-12-01    B  70  0  0
22 2018-12-01    C  90  1  0

Although the question seems long I know that the answers with Pandas can be in a few lines of code. If you can let me know how much faster are your codes than mine it would be very useful.

like image 989
Mario Arend Avatar asked May 10 '19 03:05

Mario Arend


2 Answers

You are looking for groupby + transform. Here the .transform is key as it broadcasts the result back to all rows in the original DataFrame that belong to that group.

For the first we can transform on x, simply.

For the unique types, it's faster to bring it to a column than it is to deal with the index, so assign a column and calculate the # of unique values within group.

For the final two conditions, you can create a Boolean column to see if that row satisfies the condition, which you sum within groups.

#1 Get the mean of `x` by date
df['x_avg'] = df.groupby('date').x.transform('mean')

#2 Get the # of unique types. 
df['N'] = (df.assign(TYPE = df.index.get_level_values('type'))
             .groupby('date').TYPE.transform('nunique'))

#3 Get the number of Type == A within a group
df['num_A']  = (df.assign(eqA = (df.index.get_level_values('type') == 'A'))
                  .groupby('date').eqA.transform(sum).astype(int))

#4 Really just a slight extension of 3
df1['cond_4']  = (df1.assign(to_sum = ((df1.index.get_level_values('type') == 'A')
                                       &  (df1.z == 1)).astype(int))
                     .groupby('date').to_sum.transform(sum))

Output df:

                 x  num_A     x_avg  N
date       type                       
2018-01-01 A     1      1  3.000000  2
           B     5      1  3.000000  2
2018-02-01 B     3      0  3.000000  1
2018-03-01 A     2      1  4.000000  3
           B     7      1  4.000000  3
           C     3      1  4.000000  3
2018-04-01 A     1      1  3.500000  2
           B     6      1  3.500000  2
2018-05-01 B     3      0  3.000000  1
2018-06-01 A     5      1  3.333333  3
           B     2      1  3.333333  3
           C     3      1  3.333333  3
2018-07-01 A     1      1  1.000000  1
2018-08-01 B     9      0  6.000000  2
           C     3      0  6.000000  2
2018-09-01 A     2      1  4.500000  2
           B     7      1  4.500000  2
2018-10-01 C     3      1  5.666667  3
           A     6      1  5.666667  3
           B     8      1  5.666667  3
2018-11-01 A     2      1  2.000000  1
2018-12-01 B     7      0  8.000000  2
           C     9      0  8.000000  2
like image 130
ALollz Avatar answered Sep 29 '22 02:09

ALollz


For first case, you can try below:

df['y'] = df.groupby('date')['x'].transform(np.mean)

For last case:

df1 = df1.reset_index()
df1['h'] = np.where(df1.reset_index()['type'] == 'A', 1, 0)
df1['k'] = np.where((df1.h == 1) & (df1.z == 1), 1, 0)
df1.y = df1.groupby('date')['k'].transform(np.sum)
del df1['h']
del df1['k']
like image 44
hacker315 Avatar answered Sep 29 '22 03:09

hacker315