I have a dataset as such:
df0 = (pd.DataFrame({'year_minor_renovation': ['2023', '2025', np.nan, '2026'],
'year_intermediate_renovation': [np.nan, '2025', '2027', '2030'],
'year_major_renovation': ['2030', np.nan, np.nan, np.nan],
'costs_minor_renovation': [1000, 3000, np.nan, 2000],
'costs_intermediate_renovation': [np.nan, 5000, 5000, 10000],
'costs_major_renovation': [75000, np.nan, np.nan, np.nan]}))
year_minor_renovation | year_intermediate_renovation | year_major_renovation | costs_minor_renovation | costs_intermediate_renovation | costs_major_renovation | |
---|---|---|---|---|---|---|
0 | 2023 | NaN | 2030 | 1000.0 | NaN | 75000.0 |
1 | 2025 | 2025 | NaN | 3000.0 | 5000.0 | NaN |
2 | NaN | 2027 | NaN | NaN | 5000.0 | NaN |
3 | 2026 | 2030 | NaN | 2000.0 | 10000.0 | NaN |
Each line represents a building to renovate. It can be seen as two concatenated subsets with the same index:
df.iloc[:, :3]
for the years between 2023 and 2030 when one or multiple renovations need to be done on a specific building (the index)df.iloc[:, 3:]
is the costs correspondingSome buildings will need different renovation types at different years (ex: df.iloc[[1]]
).
I need to agreggate new columns, one per year, with the costs per building, independently of what the type of renovation is.
(pd.DataFrame({'2023': [1000, np.nan, np.nan, np.nan],
'2024': [np.nan, np.nan, np.nan, np.nan],
'2025': [np.nan, 8000, np.nan, np.nan],
'2026': [np.nan, np.nan, np.nan, 2000],
'2027': [np.nan, np.nan, 5000, np.nan],
'2028': [np.nan, np.nan, np.nan, np.nan],
'2029': [np.nan, np.nan, np.nan, np.nan],
'2030': [75000, np.nan, 5000, 10000]}))
2023 | 2024 | 2025 | 2026 | 2027 | 2028 | 2029 | 2030 | |
---|---|---|---|---|---|---|---|---|
0 | 1000.0 | NaN | NaN | NaN | NaN | NaN | NaN | 75000.0 |
1 | NaN | NaN | 8000.0 | NaN | NaN | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN | 5000.0 | NaN | NaN | 5000.0 |
3 | NaN | NaN | NaN | 2000.0 | NaN | NaN | NaN | 10000.0 |
I tried to write a groupby function to create those new columns, but even if the result gives some data I'll need later, it's some kind of too much of a synthesis for what I want at that point:
def costs_per_year(df):
dfs = []
for i in ['year_minor_renovation',
'year_intermediate_renovation',
'year_major_renovation']:
j = 'costs' + str(i[4:])
df_ = (df.groupby(i)
.agg({j : 'sum' })
.reset_index()
.rename({i:'year'}, axis =1)
)
dfs.append(df_)
# merge the dataframes
merged_df = dfs[0]
for df_ in dfs[1:]:
merged_df = merged_df.merge(df_, on='year', how='outer')
merged_df = (merged_df
.set_index('year')
.transpose()
.reset_index()
)
return merged_df
year | index | 2023 | 2025 | 2026 | 2027 | 2030 |
---|---|---|---|---|---|---|
0 | costs_minor_renovation | 1000.0 | 3000.0 | 2000.0 | NaN | NaN |
1 | costs_intermediate_renovation | NaN | 5000.0 | NaN | 5000.0 | 10000.0 |
2 | costs_major_renovation | NaN | NaN | NaN | NaN | 750000.0 |
You can use pd.wide_to_long
:
out = (pd.wide_to_long(df0.reset_index(), stubnames=['year', 'costs'], i='index', j='var', sep='_', suffix='.*')
.dropna().astype({'year': int})
.pivot_table(index='index', columns='year', values='costs', aggfunc='sum')
.rename_axis(index=None, columns=None))
out = out.reindex(columns=range(out.columns.min(), out.columns.max()+1))
Output:
>>> out
2023 2024 2025 2026 2027 2028 2029 2030
0 1000.0 NaN NaN NaN NaN NaN NaN 75000.0
1 NaN NaN 8000.0 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 5000.0 NaN NaN NaN
3 NaN NaN NaN 2000.0 NaN NaN NaN 10000.0
Step by step to better understanding the transformation:
# Step 1: flatten your dataframe
>>> out = out = pd.wide_to_long(df0.reset_index(), stubnames=['year', 'costs'], i='index', j='var', sep='_', suffix='.*')
year costs
index var
0 minor_renovation 2023 1000.0
1 minor_renovation 2025 3000.0
2 minor_renovation NaN NaN
3 minor_renovation 2026 2000.0
0 intermediate_renovation NaN NaN
1 intermediate_renovation 2025 5000.0
2 intermediate_renovation 2027 5000.0
3 intermediate_renovation 2030 10000.0
0 major_renovation 2030 75000.0
1 major_renovation NaN NaN
2 major_renovation NaN NaN
3 major_renovation NaN NaN
# Step 2: cast year to int
>>> out = out.dropna().astype({'year': int})
year costs
index var
0 minor_renovation 2023 1000.0
1 minor_renovation 2025 3000.0
3 minor_renovation 2026 2000.0
1 intermediate_renovation 2025 5000.0
2 intermediate_renovation 2027 5000.0
3 intermediate_renovation 2030 10000.0
0 major_renovation 2030 75000.0
# Step 3: reshape your dataframe
>>> out = out.pivot_table(index='index', columns='year', values='costs', aggfunc='sum')
year 2023 2025 2026 2027 2030
index
0 1000.0 NaN NaN NaN 75000.0
1 NaN 8000.0 NaN NaN NaN
2 NaN NaN NaN 5000.0 NaN
3 NaN NaN 2000.0 NaN 10000.0
# Step 4: rename axis
>>> out = out.rename_axis(index=None, columns=None)
2023 2025 2026 2027 2030
0 1000.0 NaN NaN NaN 75000.0
1 NaN 8000.0 NaN NaN NaN
2 NaN NaN NaN 5000.0 NaN
3 NaN NaN 2000.0 NaN 10000.0
# Step 5: add missing columns
>>> out = out.reindex(columns=range(out.columns.min(), out.columns.max()+1))
2023 2024 2025 2026 2027 2028 2029 2030
0 1000.0 NaN NaN NaN NaN NaN NaN 75000.0
1 NaN NaN 8000.0 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 5000.0 NaN NaN NaN
3 NaN NaN NaN 2000.0 NaN NaN NaN 10000.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With