I have a data with a country, year and value column. the max(year) is 1985 and min(year) is 2016 but not all country have data of all year(1985-2016). So to plot comparable bar plot between countries, I want to add value 0 for missing year for each country.
for example:
df -->
country year value
0 India 2040 354
1 India 2041 357
2 India 2042 454
3 USA 2040 454
4 USA 2041 436
As USA don't have 2042 data so adding it result into:
country year value
0 India 2040 354
1 India 2041 357
2 India 2042 454
3 USA 2040 454
4 USA 2041 436
5 USA 2042 0
How to do it for each country in my data?
We can convert "year" to a Categorical column and then allow pandas GroupBy to do the heavy lifting:
df['year'] = pd.Categorical(df['year'], categories=df['year'].unique())
df.groupby(['country','year'], as_index=False).first()
country year value
0 India 2040 354.0
1 India 2041 357.0
2 India 2042 454.0
3 USA 2040 454.0
4 USA 2041 436.0
5 USA 2042 NaN
Another idea is reindexing:
mux = pd.MultiIndex.from_product([df['country'].unique(), df['year'].unique()])
(df.set_index(['country', 'year'])
.reindex(mux)
.reset_index()
.set_axis(df.columns, axis=1))
country year value
0 India 2040 354.0
1 India 2041 357.0
2 India 2042 454.0
3 USA 2040 454.0
4 USA 2041 436.0
5 USA 2042 NaN
Important caveat: Neither of these solutions will handle duplicate rows well. You will need to dedupe the rows by adding a uniquely identifying column, possibly using GroupBy.cumcount.
Let us try pivot then stack
out = df.pivot(*df).stack(dropna=False).reset_index(name='value')
country year value
0 India 2040 354.0
1 India 2041 357.0
2 India 2042 454.0
3 USA 2040 454.0
4 USA 2041 436.0
5 USA 2042 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With