I have the following pandas dataframe :
df = pd.DataFrame([
['A', 2017, 1],
['A', 2019, 1],
['B', 2017, 1],
['B', 2018, 1],
['C', 2016, 1],
['C', 2019, 1],
], columns=['ID', 'year', 'number'])
and am looking for the most efficient way to fill the missing years with a default value of 0 for the column number
The expected output is:
ID year number
0 A 2017 1
1 A 2018 0
2 A 2019 1
3 B 2017 1
4 B 2018 1
5 C 2016 1
6 C 2017 0
7 C 2018 0
8 C 2019 1
The dataframe that I have is relatively big, so I am looking for an efficient solution.
Edit:
This is the code that I have so far:
min_max_dict = df[['ID', 'year']].groupby('ID').agg([min, max]).to_dict('index')
new_ix = [[], []]
for id_ in df['ID'].unique():
for year in range(min_max_dict[id_][('year', 'min')], min_max_dict[id_][('year', 'max')]+1):
new_ix[0].append(id_)
new_ix[1].append(year)
df.set_index(['ID', 'year'], inplace=True)
df = df.reindex(new_ix, fill_value=0).reset_index()
Result
ID year number
0 A 2017 1
1 A 2018 0
2 A 2019 1
3 B 2017 1
4 B 2018 1
5 C 2016 1
6 C 2017 0
7 C 2018 0
8 C 2019 1
A slightly faster approach rather than using explode
is to use pd.Series constructor. And you can use .iloc if years are already sorted from earliest to latest.
idx = df.groupby('ID')['year'].apply(lambda x: pd.Series(np.arange(x.iloc[0], x.iloc[-1]+1))).reset_index()
df.set_index(['ID','year']).reindex(pd.MultiIndex.from_arrays([idx['ID'], idx['year']]), fill_value=0).reset_index()
Output:
ID year number
0 A 2017 1
1 A 2018 0
2 A 2019 1
3 B 2017 1
4 B 2018 1
5 C 2016 1
6 C 2017 0
7 C 2018 0
8 C 2019 1
Here is another approach with reindex
u = df.groupby('ID')['year'].apply(lambda x: range(x.min(),x.max()+1)).explode()
out = (df.set_index(['ID','year']).reindex(u.reset_index().to_numpy(),fill_value=0)
.reset_index())
ID year number
0 A 2017 1
1 A 2018 0
2 A 2019 1
3 B 2017 1
4 B 2018 1
5 C 2016 1
6 C 2017 0
7 C 2018 0
8 C 2019 1
t = df.groupby('ID')['year'].agg(['min','max']).reset_index()
t['missing'] = t.transform(lambda x: [y for y in range(x['min'], x['max']+1) if y not in x.values], axis=1)
t = t[['ID','missing']].explode('missing').dropna()
t['number'] = 0
t.columns = ['ID','year','number']
pd.concat([df,t]).sort_values(by=['ID','year'])
Output
ID year number
0 A 2017 1
0 A 2018 0
1 A 2019 1
2 B 2017 1
3 B 2018 1
4 C 2016 1
2 C 2017 0
2 C 2018 0
5 C 2019 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With