Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Insert rows and add missing data

I wonder if somebody could give a few pointers on how to proceed with the following. Being a newbie to Pandas, I feel at the moment my overall knowledge and skill level is not sufficient at the moment to be able to process the request I outline below.

I have a pandas dataframe which has a list of some 2000+ part numbers. For each part there are years of sale for the parts, a month number, a quantity sold and a sales value. For each year, there are likely to be occasional missing months. In the example data shown below for year 2007,month 11 is missing as there were no sales during that month. Similarly for 2008, months 11 & 12 are missing. What I would like to do is to insert the missing months for each year and insert a row containing the appropriate year, month and a zero value for the Qty and Sales within each part_id group.
In total the data is approx. 60200, rows with approx. 2000 part id's. I do not mind spending time on developing a solution but could do with a few pointers to aid my education.

index                     Part_ID  Year     Month    Qty           Sales
60182                       ZZSSL  2007      5       11.0          724.85   
60183                       ZZSSL  2007      6        7.0          537.94   
60184                       ZZSSL  2007      7       17.0         1165.02   
60185                       ZZSSL  2007      8        3.0          159.56   
60186                       ZZSSL  2007      9       67.0         4331.28   
60187                       ZZSSL  2007     10       72.0         4582.98   
60188                       ZZSSL  2007     12       42.0         2651.42   
60189                       ZZSSL  2008      1       22.0         1422.32   
60190                       ZZSSL  2008      2       16.0         1178.98   
60191                       ZZSSL  2008      3       20.0         1276.60   
60192                       ZZSSL  2008      4       28.0         2120.84   
60193                       ZZSSL  2008      5        2.0           83.03   
60194                       ZZSSL  2008      6       16.0         1250.24   
60195                       ZZSSL  2008      9       17.0         1323.34   
60196                       ZZSSL  2008     10        2.0          197.98   
60197                       ZZSSL  2009      1       21.0         1719.30   
60198                       ZZSSL  2009      2        1.0           78.15   
60199                       ZZSSL  2009      3        3.0          281.34   
60200                       ZZSSL  2009      4       25.0         2214.25   
60201                       ZZSSL  2009      5       10.0          833.60   
60202                       ZZSSL  2009      6        1.0           83.36   
60203                       ZZSSL  2009      7        1.0           83.36
like image 825
carlmorter Avatar asked Oct 19 '22 02:10

carlmorter


2 Answers

I think you need first set_index, then unstack and reindex columns by MultiIndex created from from_product with stack:

mux = pd.MultiIndex.from_product([['Qty','Sales'],np.arange(1,13)])

print (df.set_index(['Part_ID','Year', 'Month'])
         .unstack(fill_value=0)
         .reindex(columns=mux, fill_value=0)
         .stack()
         .rename_axis(['Part_ID','Year','Month'])
         .reset_index())
   Part_ID  Year  Month   Qty    Sales
0    ZZSSL  2007      1   0.0     0.00
1    ZZSSL  2007      2   0.0     0.00
2    ZZSSL  2007      3   0.0     0.00
3    ZZSSL  2007      4   0.0     0.00
4    ZZSSL  2007      5  11.0   724.85
5    ZZSSL  2007      6   7.0   537.94
6    ZZSSL  2007      7  17.0  1165.02
7    ZZSSL  2007      8   3.0   159.56
8    ZZSSL  2007      9  67.0  4331.28
9    ZZSSL  2007     10  72.0  4582.98
10   ZZSSL  2007     11   0.0     0.00
11   ZZSSL  2007     12  42.0  2651.42
12   ZZSSL  2008      1  22.0  1422.32
13   ZZSSL  2008      2  16.0  1178.98
14   ZZSSL  2008      3  20.0  1276.60
15   ZZSSL  2008      4  28.0  2120.84
16   ZZSSL  2008      5   2.0    83.03
17   ZZSSL  2008      6  16.0  1250.24
18   ZZSSL  2008      7   0.0     0.00
19   ZZSSL  2008      8   0.0     0.00
20   ZZSSL  2008      9  17.0  1323.34
21   ZZSSL  2008     10   2.0   197.98
22   ZZSSL  2008     11   0.0     0.00
23   ZZSSL  2008     12   0.0     0.00
24   ZZSSL  2009      1  21.0  1719.30
25   ZZSSL  2009      2   1.0    78.15
26   ZZSSL  2009      3   3.0   281.34
27   ZZSSL  2009      4  25.0  2214.25
28   ZZSSL  2009      5  10.0   833.60
29   ZZSSL  2009      6   1.0    83.36
30   ZZSSL  2009      7   1.0    83.36
31   ZZSSL  2009      8   0.0     0.00
32   ZZSSL  2009      9   0.0     0.00
33   ZZSSL  2009     10   0.0     0.00
34   ZZSSL  2009     11   0.0     0.00
35   ZZSSL  2009     12   0.0     0.00

If need only missing values between start and end Month for each year:

df['Month'] = pd.to_datetime(df.Month.astype(str) + '-01-' 
                                                  + df.Year.astype(str))
df = df.set_index('Month')
       .groupby(['Part_ID','Year'])
       .resample('MS')
       .asfreq()
       .fillna(0)
       .drop(['Part_ID','Year'], axis=1)
       .reset_index()
df['Month'] = df['Month'].dt.month 
print (df)
   Part_ID  Year  Month   Qty    Sales
0    ZZSSL  2007      5  11.0   724.85
1    ZZSSL  2007      6   7.0   537.94
2    ZZSSL  2007      7  17.0  1165.02
3    ZZSSL  2007      8   3.0   159.56
4    ZZSSL  2007      9  67.0  4331.28
5    ZZSSL  2007     10  72.0  4582.98
6    ZZSSL  2007     11   0.0     0.00
7    ZZSSL  2007     12  42.0  2651.42
8    ZZSSL  2008      1  22.0  1422.32
9    ZZSSL  2008      2  16.0  1178.98
10   ZZSSL  2008      3  20.0  1276.60
11   ZZSSL  2008      4  28.0  2120.84
12   ZZSSL  2008      5   2.0    83.03
13   ZZSSL  2008      6  16.0  1250.24
14   ZZSSL  2008      7   0.0     0.00
15   ZZSSL  2008      8   0.0     0.00
16   ZZSSL  2008      9  17.0  1323.34
17   ZZSSL  2008     10   2.0   197.98
18   ZZSSL  2009      1  21.0  1719.30
19   ZZSSL  2009      2   1.0    78.15
20   ZZSSL  2009      3   3.0   281.34
21   ZZSSL  2009      4  25.0  2214.25
22   ZZSSL  2009      5  10.0   833.60
23   ZZSSL  2009      6   1.0    83.36
24   ZZSSL  2009      7   1.0    83.36
like image 86
jezrael Avatar answered Oct 27 '22 15:10

jezrael


try this:

In [220]: r = (df.reset_index()
   .....:        .set_index(pd.to_datetime(df.Year.map(str) + '-' + df.Month.map(str).str.zfill(2) + '-01'))
   .....:        .resample('MS')
   .....: )

In [221]: new = r.pad().drop(['Qty','Sales'],1).join(r.mean().replace(np.nan, 0)[['Qty','Sales']])

In [222]: new.Month = new.index.month

In [223]: new.reset_index(drop=True)
Out[223]:
    index Part_ID  Year  Month   Qty    Sales
0   60182   ZZSSL  2007      5  11.0   724.85
1   60183   ZZSSL  2007      6   7.0   537.94
2   60184   ZZSSL  2007      7  17.0  1165.02
3   60185   ZZSSL  2007      8   3.0   159.56
4   60186   ZZSSL  2007      9  67.0  4331.28
5   60187   ZZSSL  2007     10  72.0  4582.98
6   60187   ZZSSL  2007     11   0.0     0.00
7   60188   ZZSSL  2007     12  42.0  2651.42
8   60189   ZZSSL  2008      1  22.0  1422.32
9   60190   ZZSSL  2008      2  16.0  1178.98
10  60191   ZZSSL  2008      3  20.0  1276.60
11  60192   ZZSSL  2008      4  28.0  2120.84
12  60193   ZZSSL  2008      5   2.0    83.03
13  60194   ZZSSL  2008      6  16.0  1250.24
14  60194   ZZSSL  2008      7   0.0     0.00
15  60194   ZZSSL  2008      8   0.0     0.00
16  60195   ZZSSL  2008      9  17.0  1323.34
17  60196   ZZSSL  2008     10   2.0   197.98
18  60196   ZZSSL  2008     11   0.0     0.00
19  60196   ZZSSL  2008     12   0.0     0.00
20  60197   ZZSSL  2009      1  21.0  1719.30
21  60198   ZZSSL  2009      2   1.0    78.15
22  60199   ZZSSL  2009      3   3.0   281.34
23  60200   ZZSSL  2009      4  25.0  2214.25
24  60201   ZZSSL  2009      5  10.0   833.60
25  60202   ZZSSL  2009      6   1.0    83.36
26  60203   ZZSSL  2009      7   1.0    83.36
like image 27
MaxU - stop WAR against UA Avatar answered Oct 27 '22 17:10

MaxU - stop WAR against UA