I have a data frame that comes probably from a pivot table, so it looks quite messy. It is not just an XY 2 dimension data, but a 4 or 5.
What I am trying to do is, if certain conditions are fulfilled, then add a new column and compute the YoY change starting from row 3 (row 0,1,2 are strings).
And in order to avoid the zero division error (I'll get errors if do the math directly) when computing, I have to add loops in loops (if
else
function to input 0 first). By using time()
function, I can tell that the program takes me almost 10 mins every time running it.
I wonder if there is any way smarter to program it. Appreciate your kind advice!
for i in range(1, 187):
k = i * 1000
if df.iloc[1, i] == df.iloc[1, i - 1] and df.iloc[1, i] != df.iloc[1, i+ 1]:
df[k] = 0
a = df.columns.get_loc[k]
df.iloc[0, a] = df.iloc[0, i]
df.iloc[1, a] = df.iloc[1, i]
df.iloc[2, a] = 'YoY%'
for m in range(3, 13259):
if df.iloc[m, i - 1] == 0:
df.iloc[m, a] = 0
else:
df.iloc[m, a] = (df.iloc[m, i] - df.iloc[m, i - 1]) / df.iloc[m, i - 1]
df = df.sort_values(by=[0, 1, 2], axis=1)
The fragment of the data frame before sorting would look like this:
0 cash cash cash cash trade trade cash cash trade
1 54100 54100 54103 54103 52209 52209 54100 54103 52209
2 02-21 03-21 02-21 03-21 02-21 03-21 YoY% YoY% YoY%
3 0 0 1500 1500 -28.1 -2476.8 0 0 87.14
4 0 500 16500 16500 25 35 0 0 0.4
5 0 0 300 600 3.1 420 0 1 134.5
.. .. .. .. .. .. .. .. ..
13258 1973.7 4484.9 18300 18600 0 750 1.3 0.02 0
It would be easier to answer your question if you provided a sample of your data in a format that is easy to reproduce and work with. I will try to help:
Let's assume you have the following data:
records = [
['cash','cash','cash','cash','trade','trade'],
[54100,54100,54103,54103,52209,52209],
['02-21','03-21','02-21','03-21','02-21','03-21'],
[0,0,1500,1500,-28.1,-2476.8],
[0,500,16500,16500,25,35],
[0,0,300,600,3.1,420],
[1973.7,4484.9,18300,18600,0,750],
]
index = [0,1,2,3,4,5,13259]
df = pd.DataFrame.from_records(records, index=index)
Which produces the following DataFrame:
0 1 2 3 4 5
0 cash cash cash cash trade trade
1 54100 54100 54103 54103 52209 52209
2 02-21 03-21 02-21 03-21 02-21 03-21
3 0 0 1500 1500 -28.1 -2476.8
4 0 500 16500 16500 25 35
5 0 0 300 600 3.1 420
13259 1973.7 4484.9 18300 18600 0 750
It seems as if the three first rows of your data are actually table headers. We can turn them into a muliindex to take full advantage of pandas' capabilities:
# Set columns index from first rows
df.columns = [df.iloc[i] for i in range(3)]
# Delete first rows used for column names
df = df.iloc[3:]
This produces the following dataframe (notice the first three rows are now the dataframe's columns):
0 cash trade
1 54100 54103 52209
2 02-21 03-21 02-21 03-21 02-21 03-21
3 0 0 1500 1500 -28.1 -2476.8
4 0 500 16500 16500 25 35
5 0 0 300 600 3.1 420
13259 1973.7 4484.9 18300 18600 0 750
Now you can use pandas to group along the columns. First things first: we need a function to apply to each pair of columns when we group them. Lets say we define the following funcion to calculate the increment between two months:
def calculate_incr(df):
''' Calculates the increment between a dataframe's first two colums.
Ignores division by zero errors.
'''
mask = df.iloc[:,0] != 0 # prevents divide by zero error
return df.loc[mask].iloc[:,1] / df.loc[mask].iloc[:,0] - 1
We can finally group our data and calculate the increments. It is a bit tricky, because pandas doesn't seem to like grouping on several column levels, so we are going to combine the column labels first, and the split them again in later steps to combine with the original data:
In []: df.groupby(by=lambda l: '_'.join(map(str, l[:2])), axis=1).agg(calculate_incr)
Out[]:
cash_54100 cash_54103 trade_52209
3 NaN 0.000000 87.142349
4 NaN 0.000000 0.400000
5 NaN 1.000000 134.483871
13259 1.272331 0.016393 NaN
Numbers seem to be right, but you probably want to integrate this result into your original table and replace the NaN
s for the division by zero errors by 0s, so let's claculate the increments again, but this time prettyfied, splitting the columns into a multiindex:
grouped = df.groupby(by=lambda l: '_'.join(map(str, l[:2])), axis=1)
increments = {k: calculate_incr(grp) for k, grp in grouped}
increments = pd.DataFrame(increments).fillna(0)
increments.columns = increments.columns.map(lambda c: tuple([*c.split('_'), 'YoY%']))
Aha! This looks nicer now:
In []: increments
Out[]:
cash trade
54100 54103 52209
YoY% YoY% YoY%
3 0.000000 0.000000 87.142349
4 0.000000 0.000000 0.400000
5 0.000000 1.000000 134.483871
13259 1.272331 0.016393 0.000000
It is also now easy to incorporate this results to your original data:
In []: result = pd.concat([df, increments], axis=1)
In []: result
Out[]:
0 cash trade cash trade
1 54100 54103 52209 54100 54103 52209
2 02-21 03-21 02-21 03-21 02-21 03-21 YoY% YoY% YoY%
3 0 0 1500 1500 -28.1 -2476.8 0.000000 0.000000 87.142349
4 0 500 16500 16500 25 35 0.000000 0.000000 0.400000
5 0 0 300 600 3.1 420 0.000000 1.000000 134.483871
13259 1973.7 4484.9 18300 18600 0 750 1.272331 0.016393 0.000000
Finally, if you want to leave the table as in the example your provided, without headers (i.e. multiindex columns), just use this trick to reset the columns:
In []: result = result.columns.to_frame().T.append(result, ignore_index=True).T.reset_index(drop=True).T
In []: result
Out[]:
0 1 2 3 4 5 6 7 8
0 cash cash cash cash trade trade cash cash trade
1 54100 54100 54103 54103 52209 52209 54100 54103 52209
2 02-21 03-21 02-21 03-21 02-21 03-21 YoY% YoY% YoY%
3 0 0 1500 1500 -28.1 -2476.8 0 0 87.1423
4 0 500 16500 16500 25 35 0 0 0.4
5 0 0 300 600 3.1 420 0 1 134.484
6 1973.7 4484.9 18300 18600 0 750 1.27233 0.0163934 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With