Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate 2 subsets of a dataframe - keep original index & use the first subset's values as column headers

I have a dataset as such:

df0 = (pd.DataFrame({'year_minor_renovation': ['2023', '2025', np.nan, '2026'],
               'year_intermediate_renovation': [np.nan, '2025', '2027', '2030'],
               'year_major_renovation': ['2030', np.nan, np.nan, np.nan],
               'costs_minor_renovation': [1000, 3000, np.nan, 2000],
               'costs_intermediate_renovation': [np.nan, 5000, 5000, 10000],
               'costs_major_renovation': [75000, np.nan, np.nan, np.nan]}))
year_minor_renovation year_intermediate_renovation year_major_renovation costs_minor_renovation costs_intermediate_renovation costs_major_renovation
0 2023 NaN 2030 1000.0 NaN 75000.0
1 2025 2025 NaN 3000.0 5000.0 NaN
2 NaN 2027 NaN NaN 5000.0 NaN
3 2026 2030 NaN 2000.0 10000.0 NaN

Each line represents a building to renovate. It can be seen as two concatenated subsets with the same index:

  • Left half df.iloc[:, :3] for the years between 2023 and 2030 when one or multiple renovations need to be done on a specific building (the index)
  • Right half df.iloc[:, 3:] is the costs corresponding

What I want

Some buildings will need different renovation types at different years (ex: df.iloc[[1]]).

I need to agreggate new columns, one per year, with the costs per building, independently of what the type of renovation is.

(pd.DataFrame({'2023': [1000, np.nan, np.nan, np.nan],
              '2024': [np.nan, np.nan, np.nan, np.nan],
              '2025': [np.nan, 8000, np.nan, np.nan],
              '2026': [np.nan, np.nan, np.nan, 2000],
              '2027': [np.nan, np.nan, 5000, np.nan],
              '2028': [np.nan, np.nan, np.nan, np.nan],
              '2029': [np.nan, np.nan, np.nan, np.nan],
              '2030': [75000, np.nan, 5000, 10000]}))
2023 2024 2025 2026 2027 2028 2029 2030
0 1000.0 NaN NaN NaN NaN NaN NaN 75000.0
1 NaN NaN 8000.0 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 5000.0 NaN NaN 5000.0
3 NaN NaN NaN 2000.0 NaN NaN NaN 10000.0

What I tried

I tried to write a groupby function to create those new columns, but even if the result gives some data I'll need later, it's some kind of too much of a synthesis for what I want at that point:

def costs_per_year(df):
    dfs = []
    for i in ['year_minor_renovation',
              'year_intermediate_renovation',
              'year_major_renovation']:
        j =  'costs' + str(i[4:])
        df_ = (df.groupby(i)
               .agg({j : 'sum' })
               .reset_index()
               .rename({i:'year'}, axis =1)
              )
        dfs.append(df_)
        
        # merge the dataframes 
        merged_df = dfs[0]
    for df_ in dfs[1:]:
        merged_df = merged_df.merge(df_, on='year', how='outer')
    
    merged_df = (merged_df
                 .set_index('year')
                 .transpose()
                 .reset_index()
                )
   
    return merged_df
year index 2023 2025 2026 2027 2030
0 costs_minor_renovation 1000.0 3000.0 2000.0 NaN NaN
1 costs_intermediate_renovation NaN 5000.0 NaN 5000.0 10000.0
2 costs_major_renovation NaN NaN NaN NaN 750000.0
like image 941
DBO5 Avatar asked Sep 19 '25 20:09

DBO5


1 Answers

You can use pd.wide_to_long:

out = (pd.wide_to_long(df0.reset_index(), stubnames=['year', 'costs'], i='index', j='var', sep='_', suffix='.*')
         .dropna().astype({'year': int})
         .pivot_table(index='index', columns='year', values='costs', aggfunc='sum')
         .rename_axis(index=None, columns=None))

out = out.reindex(columns=range(out.columns.min(), out.columns.max()+1))

Output:

>>> out
     2023  2024    2025    2026    2027  2028  2029     2030
0  1000.0   NaN     NaN     NaN     NaN   NaN   NaN  75000.0
1     NaN   NaN  8000.0     NaN     NaN   NaN   NaN      NaN
2     NaN   NaN     NaN     NaN  5000.0   NaN   NaN      NaN
3     NaN   NaN     NaN  2000.0     NaN   NaN   NaN  10000.0

Step by step to better understanding the transformation:

# Step 1: flatten your dataframe
>>> out =  out = pd.wide_to_long(df0.reset_index(), stubnames=['year', 'costs'], i='index', j='var', sep='_', suffix='.*')
                               year    costs
index var                                   
0     minor_renovation         2023   1000.0
1     minor_renovation         2025   3000.0
2     minor_renovation          NaN      NaN
3     minor_renovation         2026   2000.0
0     intermediate_renovation   NaN      NaN
1     intermediate_renovation  2025   5000.0
2     intermediate_renovation  2027   5000.0
3     intermediate_renovation  2030  10000.0
0     major_renovation         2030  75000.0
1     major_renovation          NaN      NaN
2     major_renovation          NaN      NaN
3     major_renovation          NaN      NaN

# Step 2: cast year to int
>>> out = out.dropna().astype({'year': int})
                               year    costs
index var                                   
0     minor_renovation         2023   1000.0
1     minor_renovation         2025   3000.0
3     minor_renovation         2026   2000.0
1     intermediate_renovation  2025   5000.0
2     intermediate_renovation  2027   5000.0
3     intermediate_renovation  2030  10000.0
0     major_renovation         2030  75000.0

# Step 3: reshape your dataframe
>>> out = out.pivot_table(index='index', columns='year', values='costs', aggfunc='sum')
year     2023    2025    2026    2027     2030
index                                         
0      1000.0     NaN     NaN     NaN  75000.0
1         NaN  8000.0     NaN     NaN      NaN
2         NaN     NaN     NaN  5000.0      NaN
3         NaN     NaN  2000.0     NaN  10000.0

# Step 4: rename axis
>>> out = out.rename_axis(index=None, columns=None)
     2023    2025    2026    2027     2030
0  1000.0     NaN     NaN     NaN  75000.0
1     NaN  8000.0     NaN     NaN      NaN
2     NaN     NaN     NaN  5000.0      NaN
3     NaN     NaN  2000.0     NaN  10000.0

# Step 5: add missing columns
>>> out = out.reindex(columns=range(out.columns.min(), out.columns.max()+1))
     2023  2024    2025    2026    2027  2028  2029     2030
0  1000.0   NaN     NaN     NaN     NaN   NaN   NaN  75000.0
1     NaN   NaN  8000.0     NaN     NaN   NaN   NaN      NaN
2     NaN   NaN     NaN     NaN  5000.0   NaN   NaN      NaN
3     NaN   NaN     NaN  2000.0     NaN   NaN   NaN  10000.0
like image 60
Corralien Avatar answered Sep 21 '25 11:09

Corralien