I wonder if somebody could give a few pointers on how to proceed with the following. Being a newbie to Pandas, I feel at the moment my overall knowledge and skill level is not sufficient at the moment to be able to process the request I outline below.
I have a pandas dataframe which has a list of some 2000+ part numbers. For each part there are years of sale for the parts, a month number, a quantity sold and a sales value. For each year, there are likely to be occasional missing months. In the example data shown below for year 2007,month 11 is missing as there were no sales during that month. Similarly for 2008, months 11 & 12 are missing. What I would like to do is to insert the missing months for each year and insert a row containing the appropriate year, month and a zero value for the Qty and Sales within each part_id group.
In total the data is approx. 60200, rows with approx. 2000 part id's.
I do not mind spending time on developing a solution but could do with a few pointers to aid my education.
index Part_ID Year Month Qty Sales
60182 ZZSSL 2007 5 11.0 724.85
60183 ZZSSL 2007 6 7.0 537.94
60184 ZZSSL 2007 7 17.0 1165.02
60185 ZZSSL 2007 8 3.0 159.56
60186 ZZSSL 2007 9 67.0 4331.28
60187 ZZSSL 2007 10 72.0 4582.98
60188 ZZSSL 2007 12 42.0 2651.42
60189 ZZSSL 2008 1 22.0 1422.32
60190 ZZSSL 2008 2 16.0 1178.98
60191 ZZSSL 2008 3 20.0 1276.60
60192 ZZSSL 2008 4 28.0 2120.84
60193 ZZSSL 2008 5 2.0 83.03
60194 ZZSSL 2008 6 16.0 1250.24
60195 ZZSSL 2008 9 17.0 1323.34
60196 ZZSSL 2008 10 2.0 197.98
60197 ZZSSL 2009 1 21.0 1719.30
60198 ZZSSL 2009 2 1.0 78.15
60199 ZZSSL 2009 3 3.0 281.34
60200 ZZSSL 2009 4 25.0 2214.25
60201 ZZSSL 2009 5 10.0 833.60
60202 ZZSSL 2009 6 1.0 83.36
60203 ZZSSL 2009 7 1.0 83.36
I think you need first set_index
, then unstack
and reindex
columns by MultiIndex
created from from_product
with stack
:
mux = pd.MultiIndex.from_product([['Qty','Sales'],np.arange(1,13)])
print (df.set_index(['Part_ID','Year', 'Month'])
.unstack(fill_value=0)
.reindex(columns=mux, fill_value=0)
.stack()
.rename_axis(['Part_ID','Year','Month'])
.reset_index())
Part_ID Year Month Qty Sales
0 ZZSSL 2007 1 0.0 0.00
1 ZZSSL 2007 2 0.0 0.00
2 ZZSSL 2007 3 0.0 0.00
3 ZZSSL 2007 4 0.0 0.00
4 ZZSSL 2007 5 11.0 724.85
5 ZZSSL 2007 6 7.0 537.94
6 ZZSSL 2007 7 17.0 1165.02
7 ZZSSL 2007 8 3.0 159.56
8 ZZSSL 2007 9 67.0 4331.28
9 ZZSSL 2007 10 72.0 4582.98
10 ZZSSL 2007 11 0.0 0.00
11 ZZSSL 2007 12 42.0 2651.42
12 ZZSSL 2008 1 22.0 1422.32
13 ZZSSL 2008 2 16.0 1178.98
14 ZZSSL 2008 3 20.0 1276.60
15 ZZSSL 2008 4 28.0 2120.84
16 ZZSSL 2008 5 2.0 83.03
17 ZZSSL 2008 6 16.0 1250.24
18 ZZSSL 2008 7 0.0 0.00
19 ZZSSL 2008 8 0.0 0.00
20 ZZSSL 2008 9 17.0 1323.34
21 ZZSSL 2008 10 2.0 197.98
22 ZZSSL 2008 11 0.0 0.00
23 ZZSSL 2008 12 0.0 0.00
24 ZZSSL 2009 1 21.0 1719.30
25 ZZSSL 2009 2 1.0 78.15
26 ZZSSL 2009 3 3.0 281.34
27 ZZSSL 2009 4 25.0 2214.25
28 ZZSSL 2009 5 10.0 833.60
29 ZZSSL 2009 6 1.0 83.36
30 ZZSSL 2009 7 1.0 83.36
31 ZZSSL 2009 8 0.0 0.00
32 ZZSSL 2009 9 0.0 0.00
33 ZZSSL 2009 10 0.0 0.00
34 ZZSSL 2009 11 0.0 0.00
35 ZZSSL 2009 12 0.0 0.00
If need only missing values between start and end Month
for each year
:
df['Month'] = pd.to_datetime(df.Month.astype(str) + '-01-'
+ df.Year.astype(str))
df = df.set_index('Month')
.groupby(['Part_ID','Year'])
.resample('MS')
.asfreq()
.fillna(0)
.drop(['Part_ID','Year'], axis=1)
.reset_index()
df['Month'] = df['Month'].dt.month
print (df)
Part_ID Year Month Qty Sales
0 ZZSSL 2007 5 11.0 724.85
1 ZZSSL 2007 6 7.0 537.94
2 ZZSSL 2007 7 17.0 1165.02
3 ZZSSL 2007 8 3.0 159.56
4 ZZSSL 2007 9 67.0 4331.28
5 ZZSSL 2007 10 72.0 4582.98
6 ZZSSL 2007 11 0.0 0.00
7 ZZSSL 2007 12 42.0 2651.42
8 ZZSSL 2008 1 22.0 1422.32
9 ZZSSL 2008 2 16.0 1178.98
10 ZZSSL 2008 3 20.0 1276.60
11 ZZSSL 2008 4 28.0 2120.84
12 ZZSSL 2008 5 2.0 83.03
13 ZZSSL 2008 6 16.0 1250.24
14 ZZSSL 2008 7 0.0 0.00
15 ZZSSL 2008 8 0.0 0.00
16 ZZSSL 2008 9 17.0 1323.34
17 ZZSSL 2008 10 2.0 197.98
18 ZZSSL 2009 1 21.0 1719.30
19 ZZSSL 2009 2 1.0 78.15
20 ZZSSL 2009 3 3.0 281.34
21 ZZSSL 2009 4 25.0 2214.25
22 ZZSSL 2009 5 10.0 833.60
23 ZZSSL 2009 6 1.0 83.36
24 ZZSSL 2009 7 1.0 83.36
try this:
In [220]: r = (df.reset_index()
.....: .set_index(pd.to_datetime(df.Year.map(str) + '-' + df.Month.map(str).str.zfill(2) + '-01'))
.....: .resample('MS')
.....: )
In [221]: new = r.pad().drop(['Qty','Sales'],1).join(r.mean().replace(np.nan, 0)[['Qty','Sales']])
In [222]: new.Month = new.index.month
In [223]: new.reset_index(drop=True)
Out[223]:
index Part_ID Year Month Qty Sales
0 60182 ZZSSL 2007 5 11.0 724.85
1 60183 ZZSSL 2007 6 7.0 537.94
2 60184 ZZSSL 2007 7 17.0 1165.02
3 60185 ZZSSL 2007 8 3.0 159.56
4 60186 ZZSSL 2007 9 67.0 4331.28
5 60187 ZZSSL 2007 10 72.0 4582.98
6 60187 ZZSSL 2007 11 0.0 0.00
7 60188 ZZSSL 2007 12 42.0 2651.42
8 60189 ZZSSL 2008 1 22.0 1422.32
9 60190 ZZSSL 2008 2 16.0 1178.98
10 60191 ZZSSL 2008 3 20.0 1276.60
11 60192 ZZSSL 2008 4 28.0 2120.84
12 60193 ZZSSL 2008 5 2.0 83.03
13 60194 ZZSSL 2008 6 16.0 1250.24
14 60194 ZZSSL 2008 7 0.0 0.00
15 60194 ZZSSL 2008 8 0.0 0.00
16 60195 ZZSSL 2008 9 17.0 1323.34
17 60196 ZZSSL 2008 10 2.0 197.98
18 60196 ZZSSL 2008 11 0.0 0.00
19 60196 ZZSSL 2008 12 0.0 0.00
20 60197 ZZSSL 2009 1 21.0 1719.30
21 60198 ZZSSL 2009 2 1.0 78.15
22 60199 ZZSSL 2009 3 3.0 281.34
23 60200 ZZSSL 2009 4 25.0 2214.25
24 60201 ZZSSL 2009 5 10.0 833.60
25 60202 ZZSSL 2009 6 1.0 83.36
26 60203 ZZSSL 2009 7 1.0 83.36
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With