I have a dataset as such:
df0 = (pd.DataFrame({'year_minor_renovation': ['2023', '2025', np.nan, '2026'],
               'year_intermediate_renovation': [np.nan, '2025', '2027', '2030'],
               'year_major_renovation': ['2030', np.nan, np.nan, np.nan],
               'costs_minor_renovation': [1000, 3000, np.nan, 2000],
               'costs_intermediate_renovation': [np.nan, 5000, 5000, 10000],
               'costs_major_renovation': [75000, np.nan, np.nan, np.nan]}))
| year_minor_renovation | year_intermediate_renovation | year_major_renovation | costs_minor_renovation | costs_intermediate_renovation | costs_major_renovation | |
|---|---|---|---|---|---|---|
| 0 | 2023 | NaN | 2030 | 1000.0 | NaN | 75000.0 | 
| 1 | 2025 | 2025 | NaN | 3000.0 | 5000.0 | NaN | 
| 2 | NaN | 2027 | NaN | NaN | 5000.0 | NaN | 
| 3 | 2026 | 2030 | NaN | 2000.0 | 10000.0 | NaN | 
Each line represents a building to renovate. It can be seen as two concatenated subsets with the same index:
df.iloc[:, :3] for the years between 2023 and 2030 when one or multiple renovations need to be done on a specific building (the index)df.iloc[:, 3:] is the costs correspondingSome buildings will need different renovation types at different years (ex: df.iloc[[1]]).
I need to agreggate new columns, one per year, with the costs per building, independently of what the type of renovation is.
(pd.DataFrame({'2023': [1000, np.nan, np.nan, np.nan],
              '2024': [np.nan, np.nan, np.nan, np.nan],
              '2025': [np.nan, 8000, np.nan, np.nan],
              '2026': [np.nan, np.nan, np.nan, 2000],
              '2027': [np.nan, np.nan, 5000, np.nan],
              '2028': [np.nan, np.nan, np.nan, np.nan],
              '2029': [np.nan, np.nan, np.nan, np.nan],
              '2030': [75000, np.nan, 5000, 10000]}))
| 2023 | 2024 | 2025 | 2026 | 2027 | 2028 | 2029 | 2030 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1000.0 | NaN | NaN | NaN | NaN | NaN | NaN | 75000.0 | 
| 1 | NaN | NaN | 8000.0 | NaN | NaN | NaN | NaN | NaN | 
| 2 | NaN | NaN | NaN | NaN | 5000.0 | NaN | NaN | 5000.0 | 
| 3 | NaN | NaN | NaN | 2000.0 | NaN | NaN | NaN | 10000.0 | 
I tried to write a groupby function to create those new columns, but even if the result gives some data I'll need later, it's some kind of too much of a synthesis for what I want at that point:
def costs_per_year(df):
    dfs = []
    for i in ['year_minor_renovation',
              'year_intermediate_renovation',
              'year_major_renovation']:
        j =  'costs' + str(i[4:])
        df_ = (df.groupby(i)
               .agg({j : 'sum' })
               .reset_index()
               .rename({i:'year'}, axis =1)
              )
        dfs.append(df_)
        
        # merge the dataframes 
        merged_df = dfs[0]
    for df_ in dfs[1:]:
        merged_df = merged_df.merge(df_, on='year', how='outer')
    
    merged_df = (merged_df
                 .set_index('year')
                 .transpose()
                 .reset_index()
                )
   
    return merged_df
| year | index | 2023 | 2025 | 2026 | 2027 | 2030 | 
|---|---|---|---|---|---|---|
| 0 | costs_minor_renovation | 1000.0 | 3000.0 | 2000.0 | NaN | NaN | 
| 1 | costs_intermediate_renovation | NaN | 5000.0 | NaN | 5000.0 | 10000.0 | 
| 2 | costs_major_renovation | NaN | NaN | NaN | NaN | 750000.0 | 
You can use pd.wide_to_long:
out = (pd.wide_to_long(df0.reset_index(), stubnames=['year', 'costs'], i='index', j='var', sep='_', suffix='.*')
         .dropna().astype({'year': int})
         .pivot_table(index='index', columns='year', values='costs', aggfunc='sum')
         .rename_axis(index=None, columns=None))
out = out.reindex(columns=range(out.columns.min(), out.columns.max()+1))
Output:
>>> out
     2023  2024    2025    2026    2027  2028  2029     2030
0  1000.0   NaN     NaN     NaN     NaN   NaN   NaN  75000.0
1     NaN   NaN  8000.0     NaN     NaN   NaN   NaN      NaN
2     NaN   NaN     NaN     NaN  5000.0   NaN   NaN      NaN
3     NaN   NaN     NaN  2000.0     NaN   NaN   NaN  10000.0
Step by step to better understanding the transformation:
# Step 1: flatten your dataframe
>>> out =  out = pd.wide_to_long(df0.reset_index(), stubnames=['year', 'costs'], i='index', j='var', sep='_', suffix='.*')
                               year    costs
index var                                   
0     minor_renovation         2023   1000.0
1     minor_renovation         2025   3000.0
2     minor_renovation          NaN      NaN
3     minor_renovation         2026   2000.0
0     intermediate_renovation   NaN      NaN
1     intermediate_renovation  2025   5000.0
2     intermediate_renovation  2027   5000.0
3     intermediate_renovation  2030  10000.0
0     major_renovation         2030  75000.0
1     major_renovation          NaN      NaN
2     major_renovation          NaN      NaN
3     major_renovation          NaN      NaN
# Step 2: cast year to int
>>> out = out.dropna().astype({'year': int})
                               year    costs
index var                                   
0     minor_renovation         2023   1000.0
1     minor_renovation         2025   3000.0
3     minor_renovation         2026   2000.0
1     intermediate_renovation  2025   5000.0
2     intermediate_renovation  2027   5000.0
3     intermediate_renovation  2030  10000.0
0     major_renovation         2030  75000.0
# Step 3: reshape your dataframe
>>> out = out.pivot_table(index='index', columns='year', values='costs', aggfunc='sum')
year     2023    2025    2026    2027     2030
index                                         
0      1000.0     NaN     NaN     NaN  75000.0
1         NaN  8000.0     NaN     NaN      NaN
2         NaN     NaN     NaN  5000.0      NaN
3         NaN     NaN  2000.0     NaN  10000.0
# Step 4: rename axis
>>> out = out.rename_axis(index=None, columns=None)
     2023    2025    2026    2027     2030
0  1000.0     NaN     NaN     NaN  75000.0
1     NaN  8000.0     NaN     NaN      NaN
2     NaN     NaN     NaN  5000.0      NaN
3     NaN     NaN  2000.0     NaN  10000.0
# Step 5: add missing columns
>>> out = out.reindex(columns=range(out.columns.min(), out.columns.max()+1))
     2023  2024    2025    2026    2027  2028  2029     2030
0  1000.0   NaN     NaN     NaN     NaN   NaN   NaN  75000.0
1     NaN   NaN  8000.0     NaN     NaN   NaN   NaN      NaN
2     NaN   NaN     NaN     NaN  5000.0   NaN   NaN      NaN
3     NaN   NaN     NaN  2000.0     NaN   NaN   NaN  10000.0
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With