I have a pandas dataframe that looks like this:
| Index | Year | Month | Fruit | Count |
|---|---|---|---|---|
| 0 | 2021 | 03 | Apple | 2 |
| 1 | 2021 | 03 | Orange | 3 |
| 2 | 2021 | 03 | Mango | 4 |
| 3 | 2021 | 04 | Apple | 1 |
| 4 | 2021 | 04 | Mango | 2 |
| 5 | 2021 | 05 | Apple | 1 |
Let's say that Apple, Orange and Mango are the only fruits I am interested in.
The dataframe is formed such that if the count for a fruit (in a month and year) is 0, it won't show up as a record. I want to add the missing fruit row to the dataframe with count column value = 0. In short, I want each of the 3 fruits to show up for the year and month irrespective of the count being 0. So I want it to look like this:
| Index | Year | Month | Fruit | Count |
|---|---|---|---|---|
| 0 | 2021 | 03 | Apple | 2 |
| 1 | 2021 | 03 | Orange | 3 |
| 2 | 2021 | 03 | Mango | 4 |
| 3 | 2021 | 04 | Apple | 1 |
| x | 2021 | 04 | Orange | 0 |
| 4 | 2021 | 04 | Mango | 2 |
| 5 | 2021 | 05 | Apple | 1 |
| x | 2021 | 05 | Orange | 0 |
| x | 2021 | 05 | Mango | 0 |
Appreciate any suggestions on how to approach this.
You can use pyjanitor's complete to fill the missing rows:
import io
import pandas as pd
import janitor
data = '''Index Year Month Fruit Count
0 2021 03 Apple 2
1 2021 03 Orange 3
2 2021 03 Mango 4
3 2021 04 Apple 1
4 2021 04 Mango 2
5 2021 05 Apple 1'''
df = pd.read_csv(io.StringIO(data), sep='\t', index_col='Index')
df = df.complete('Fruit', ['Year', 'Month']).fillna(0, downcast='infer').sort_values(['Year', 'Month', 'Fruit'])
Output:
| Year | Month | Fruit | Count | |
|---|---|---|---|---|
| 0 | 2021 | 3 | Apple | 2 |
| 2 | 2021 | 3 | Mango | 4 |
| 1 | 2021 | 3 | Orange | 3 |
| 3 | 2021 | 4 | Apple | 1 |
| 4 | 2021 | 4 | Mango | 2 |
| 6 | 2021 | 4 | Orange | 0 |
| 5 | 2021 | 5 | Apple | 1 |
| 8 | 2021 | 5 | Mango | 0 |
| 7 | 2021 | 5 | Orange | 0 |
You can use Categorical to add Fruit as a category
Year = [2021, 2021, 2021, 2021, 2021, 2021]
Month = ["03","03", "03","04","04","05"]
Fruit = ["Apple", "Orange", "Mango", "Apple", "Mango", "Apple"]
Count = [2,3,4,1,2,1]
df = pd.DataFrame({"Year":Year, "Month":Month, "Fruit":Fruit, "Count":Count})
df['Fruit'] = pd.Categorical(df['Fruit'], categories=df['Fruit'].unique())
df = df.groupby(['Year','Month','Fruit'], as_index=False).first()
df['Count'] = df.Count.fillna(0).astype(int)
df
Output:
Year Month Fruit Count
0 2021 03 Apple 2
1 2021 03 Orange 3
2 2021 03 Mango 4
3 2021 04 Apple 1
4 2021 04 Orange 0
5 2021 04 Mango 2
6 2021 05 Apple 1
7 2021 05 Orange 0
8 2021 05 Mango 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With