Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accelerate the loop

Tags:

python

pandas

I have a data frame that comes probably from a pivot table, so it looks quite messy. It is not just an XY 2 dimension data, but a 4 or 5.

What I am trying to do is, if certain conditions are fulfilled, then add a new column and compute the YoY change starting from row 3 (row 0,1,2 are strings).

And in order to avoid the zero division error (I'll get errors if do the math directly) when computing, I have to add loops in loops (if else function to input 0 first). By using time() function, I can tell that the program takes me almost 10 mins every time running it.

I wonder if there is any way smarter to program it. Appreciate your kind advice!

for i in range(1, 187):
   k = i * 1000
   if df.iloc[1, i] == df.iloc[1, i - 1] and df.iloc[1, i] != df.iloc[1, i+ 1]:
        df[k] = 0
        a = df.columns.get_loc[k]
        df.iloc[0, a] = df.iloc[0, i]
        df.iloc[1, a] = df.iloc[1, i]
        df.iloc[2, a] = 'YoY%'
        
        for m in range(3, 13259):
            if df.iloc[m, i - 1] == 0:
                 df.iloc[m, a] = 0
            else:
                 df.iloc[m, a] = (df.iloc[m, i] - df.iloc[m, i - 1]) / df.iloc[m, i - 1]
df = df.sort_values(by=[0, 1, 2], axis=1)

The fragment of the data frame before sorting would look like this:

0      cash    cash    cash     cash     trade   trade    cash    cash    trade
1      54100   54100   54103    54103    52209   52209    54100   54103   52209
2      02-21   03-21   02-21    03-21    02-21   03-21    YoY%    YoY%    YoY%
3      0       0       1500     1500     -28.1   -2476.8  0       0       87.14 
4      0       500     16500    16500    25      35       0       0       0.4
5      0       0       300      600      3.1     420      0       1       134.5
       ..      ..      ..       ..       ..      ..       ..      ..      ..
13258  1973.7  4484.9  18300    18600    0       750      1.3     0.02    0
like image 583
alexandra udinov Avatar asked Apr 30 '21 08:04

alexandra udinov


1 Answers

It would be easier to answer your question if you provided a sample of your data in a format that is easy to reproduce and work with. I will try to help:

Let's assume you have the following data:

records = [
    ['cash','cash','cash','cash','trade','trade'],
    [54100,54100,54103,54103,52209,52209],
    ['02-21','03-21','02-21','03-21','02-21','03-21'],
    [0,0,1500,1500,-28.1,-2476.8],
    [0,500,16500,16500,25,35],
    [0,0,300,600,3.1,420],
    [1973.7,4484.9,18300,18600,0,750],
]
index = [0,1,2,3,4,5,13259]
df = pd.DataFrame.from_records(records, index=index)

Which produces the following DataFrame:

            0       1      2      3      4       5
0        cash    cash   cash   cash  trade   trade
1       54100   54100  54103  54103  52209   52209
2       02-21   03-21  02-21  03-21  02-21   03-21
3           0       0   1500   1500  -28.1 -2476.8
4           0     500  16500  16500     25      35
5           0       0    300    600    3.1     420
13259  1973.7  4484.9  18300  18600      0     750

It seems as if the three first rows of your data are actually table headers. We can turn them into a muliindex to take full advantage of pandas' capabilities:

# Set columns index from first rows
df.columns = [df.iloc[i] for i in range(3)]
# Delete first rows used for column names
df = df.iloc[3:]

This produces the following dataframe (notice the first three rows are now the dataframe's columns):

0        cash                       trade
1       54100          54103        52209
2       02-21   03-21  02-21  03-21 02-21   03-21
3           0       0   1500   1500 -28.1 -2476.8
4           0     500  16500  16500    25      35
5           0       0    300    600   3.1     420
13259  1973.7  4484.9  18300  18600     0     750

Now you can use pandas to group along the columns. First things first: we need a function to apply to each pair of columns when we group them. Lets say we define the following funcion to calculate the increment between two months:

def calculate_incr(df):
    ''' Calculates the increment between a dataframe's first two colums.
    Ignores division by zero errors.
    '''
    mask = df.iloc[:,0] != 0  # prevents divide by zero error
    return df.loc[mask].iloc[:,1] / df.loc[mask].iloc[:,0] - 1

We can finally group our data and calculate the increments. It is a bit tricky, because pandas doesn't seem to like grouping on several column levels, so we are going to combine the column labels first, and the split them again in later steps to combine with the original data:

In []: df.groupby(by=lambda l: '_'.join(map(str, l[:2])), axis=1).agg(calculate_incr)
Out[]:
       cash_54100  cash_54103  trade_52209
3             NaN    0.000000    87.142349
4             NaN    0.000000     0.400000
5             NaN    1.000000   134.483871
13259    1.272331    0.016393          NaN

Numbers seem to be right, but you probably want to integrate this result into your original table and replace the NaNs for the division by zero errors by 0s, so let's claculate the increments again, but this time prettyfied, splitting the columns into a multiindex:

grouped = df.groupby(by=lambda l: '_'.join(map(str, l[:2])), axis=1)
increments = {k: calculate_incr(grp) for k, grp in grouped}
increments = pd.DataFrame(increments).fillna(0)
increments.columns = increments.columns.map(lambda c: tuple([*c.split('_'), 'YoY%']))

Aha! This looks nicer now:

In []: increments
Out[]:
           cash                 trade
          54100     54103       52209
           YoY%      YoY%        YoY%
3      0.000000  0.000000   87.142349
4      0.000000  0.000000    0.400000
5      0.000000  1.000000  134.483871
13259  1.272331  0.016393    0.000000

It is also now easy to incorporate this results to your original data:

In []: result = pd.concat([df, increments], axis=1)

In []: result
Out[]:
0        cash                       trade              cash                 trade
1       54100          54103        52209             54100     54103       52209
2       02-21   03-21  02-21  03-21 02-21   03-21      YoY%      YoY%        YoY%
3           0       0   1500   1500 -28.1 -2476.8  0.000000  0.000000   87.142349
4           0     500  16500  16500    25      35  0.000000  0.000000    0.400000
5           0       0    300    600   3.1     420  0.000000  1.000000  134.483871
13259  1973.7  4484.9  18300  18600     0     750  1.272331  0.016393    0.000000

Finally, if you want to leave the table as in the example your provided, without headers (i.e. multiindex columns), just use this trick to reset the columns:

In []: result = result.columns.to_frame().T.append(result, ignore_index=True).T.reset_index(drop=True).T

In []: result
Out[]:
        0       1      2      3      4       5        6          7        8
0    cash    cash   cash   cash  trade   trade     cash       cash    trade
1   54100   54100  54103  54103  52209   52209    54100      54103    52209
2   02-21   03-21  02-21  03-21  02-21   03-21     YoY%       YoY%     YoY%
3       0       0   1500   1500  -28.1 -2476.8        0          0  87.1423
4       0     500  16500  16500     25      35        0          0      0.4
5       0       0    300    600    3.1     420        0          1  134.484
6  1973.7  4484.9  18300  18600      0     750  1.27233  0.0163934        0
like image 121
jabellcu Avatar answered Sep 26 '22 00:09

jabellcu