Aggregate 2 subsets of a dataframe - keep original index & use the first subset's values as column headers

Question

I have a dataset as such:

df0 = (pd.DataFrame({'year_minor_renovation': ['2023', '2025', np.nan, '2026'],
               'year_intermediate_renovation': [np.nan, '2025', '2027', '2030'],
               'year_major_renovation': ['2030', np.nan, np.nan, np.nan],
               'costs_minor_renovation': [1000, 3000, np.nan, 2000],
               'costs_intermediate_renovation': [np.nan, 5000, 5000, 10000],
               'costs_major_renovation': [75000, np.nan, np.nan, np.nan]}))

	year_minor_renovation	year_intermediate_renovation	year_major_renovation	costs_minor_renovation	costs_intermediate_renovation	costs_major_renovation
0	2023	NaN	2030	1000.0	NaN	75000.0
1	2025	2025	NaN	3000.0	5000.0	NaN
2	NaN	2027	NaN	NaN	5000.0	NaN
3	2026	2030	NaN	2000.0	10000.0	NaN

Each line represents a building to renovate. It can be seen as two concatenated subsets with the same index:

Left half df.iloc[:, :3] for the years between 2023 and 2030 when one or multiple renovations need to be done on a specific building (the index)
Right half df.iloc[:, 3:] is the costs corresponding

What I want

Some buildings will need different renovation types at different years (ex: df.iloc[[1]]).

I need to agreggate new columns, one per year, with the costs per building, independently of what the type of renovation is.

(pd.DataFrame({'2023': [1000, np.nan, np.nan, np.nan],
              '2024': [np.nan, np.nan, np.nan, np.nan],
              '2025': [np.nan, 8000, np.nan, np.nan],
              '2026': [np.nan, np.nan, np.nan, 2000],
              '2027': [np.nan, np.nan, 5000, np.nan],
              '2028': [np.nan, np.nan, np.nan, np.nan],
              '2029': [np.nan, np.nan, np.nan, np.nan],
              '2030': [75000, np.nan, 5000, 10000]}))

	2023	2024	2025	2026	2027	2028	2029	2030
0	1000.0	NaN	NaN	NaN	NaN	NaN	NaN	75000.0
1	NaN	NaN	8000.0	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	5000.0	NaN	NaN	5000.0
3	NaN	NaN	NaN	2000.0	NaN	NaN	NaN	10000.0

What I tried

I tried to write a groupby function to create those new columns, but even if the result gives some data I'll need later, it's some kind of too much of a synthesis for what I want at that point:

def costs_per_year(df):
    dfs = []
    for i in ['year_minor_renovation',
              'year_intermediate_renovation',
              'year_major_renovation']:
        j =  'costs' + str(i[4:])
        df_ = (df.groupby(i)
               .agg({j : 'sum' })
               .reset_index()
               .rename({i:'year'}, axis =1)
              )
        dfs.append(df_)
        
        # merge the dataframes 
        merged_df = dfs[0]
    for df_ in dfs[1:]:
        merged_df = merged_df.merge(df_, on='year', how='outer')
    
    merged_df = (merged_df
                 .set_index('year')
                 .transpose()
                 .reset_index()
                )
   
    return merged_df

year	index	2023	2025	2026	2027	2030
0	costs_minor_renovation	1000.0	3000.0	2000.0	NaN	NaN
1	costs_intermediate_renovation	NaN	5000.0	NaN	5000.0	10000.0
2	costs_major_renovation	NaN	NaN	NaN	NaN	750000.0

Corralien · Accepted Answer

You can use pd.wide_to_long:

out = (pd.wide_to_long(df0.reset_index(), stubnames=['year', 'costs'], i='index', j='var', sep='_', suffix='.*')
         .dropna().astype({'year': int})
         .pivot_table(index='index', columns='year', values='costs', aggfunc='sum')
         .rename_axis(index=None, columns=None))

out = out.reindex(columns=range(out.columns.min(), out.columns.max()+1))

Output:

>>> out
     2023  2024    2025    2026    2027  2028  2029     2030
0  1000.0   NaN     NaN     NaN     NaN   NaN   NaN  75000.0
1     NaN   NaN  8000.0     NaN     NaN   NaN   NaN      NaN
2     NaN   NaN     NaN     NaN  5000.0   NaN   NaN      NaN
3     NaN   NaN     NaN  2000.0     NaN   NaN   NaN  10000.0

Step by step to better understanding the transformation:

# Step 1: flatten your dataframe
>>> out =  out = pd.wide_to_long(df0.reset_index(), stubnames=['year', 'costs'], i='index', j='var', sep='_', suffix='.*')
                               year    costs
index var                                   
0     minor_renovation         2023   1000.0
1     minor_renovation         2025   3000.0
2     minor_renovation          NaN      NaN
3     minor_renovation         2026   2000.0
0     intermediate_renovation   NaN      NaN
1     intermediate_renovation  2025   5000.0
2     intermediate_renovation  2027   5000.0
3     intermediate_renovation  2030  10000.0
0     major_renovation         2030  75000.0
1     major_renovation          NaN      NaN
2     major_renovation          NaN      NaN
3     major_renovation          NaN      NaN

# Step 2: cast year to int
>>> out = out.dropna().astype({'year': int})
                               year    costs
index var                                   
0     minor_renovation         2023   1000.0
1     minor_renovation         2025   3000.0
3     minor_renovation         2026   2000.0
1     intermediate_renovation  2025   5000.0
2     intermediate_renovation  2027   5000.0
3     intermediate_renovation  2030  10000.0
0     major_renovation         2030  75000.0

# Step 3: reshape your dataframe
>>> out = out.pivot_table(index='index', columns='year', values='costs', aggfunc='sum')
year     2023    2025    2026    2027     2030
index                                         
0      1000.0     NaN     NaN     NaN  75000.0
1         NaN  8000.0     NaN     NaN      NaN
2         NaN     NaN     NaN  5000.0      NaN
3         NaN     NaN  2000.0     NaN  10000.0

# Step 4: rename axis
>>> out = out.rename_axis(index=None, columns=None)
     2023    2025    2026    2027     2030
0  1000.0     NaN     NaN     NaN  75000.0
1     NaN  8000.0     NaN     NaN      NaN
2     NaN     NaN     NaN  5000.0      NaN
3     NaN     NaN  2000.0     NaN  10000.0

# Step 5: add missing columns
>>> out = out.reindex(columns=range(out.columns.min(), out.columns.max()+1))
     2023  2024    2025    2026    2027  2028  2029     2030
0  1000.0   NaN     NaN     NaN     NaN   NaN   NaN  75000.0
1     NaN   NaN  8000.0     NaN     NaN   NaN   NaN      NaN
2     NaN   NaN     NaN     NaN  5000.0   NaN   NaN      NaN
3     NaN   NaN     NaN  2000.0     NaN   NaN   NaN  10000.0

Aggregate 2 subsets of a dataframe - keep original index & use the first subset's values as column headers

Tags:

python

pandas

dataframe

group-by

What I want

What I tried

DBO5

1 Answers

Corralien

Recent Activity

Donate For Us

Aggregate 2 subsets of a dataframe - keep original index & use the first subset's values as column headers

Tags:

python

pandas

dataframe

group-by

What I want

What I tried

DBO5

1 Answers

Corralien

Related questions

Recent Activity

Donate For Us