Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding values for missing data combinations in Pandas

Tags:

python

pandas

I've got a pandas data frame containing something like the following:

person_id   status    year    count
0           'pass'    1980    4
0           'fail'    1982    1
1           'pass'    1981    2

If I know that all possible values for each field are:

all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]

I'd like to populate the original data frame with count=0 for missing data combinations (of person_id, status, and year), i.e. I'd like the new data frame to contain:

person_id   status    year    count
0           'pass'    1980    4
0           'pass'    1981    0
0           'pass'    1982    0
0           'fail'    1980    0
0           'fail'    1981    0
0           'fail'    1982    2
1           'pass'    1980    0
1           'pass'    1981    2
1           'pass'    1982    0
1           'fail'    1980    0
1           'fail'    1981    0
1           'fail'    1982    0
2           'pass'    1980    0
2           'pass'    1981    0
2           'pass'    1982    0
2           'fail'    1980    0
2           'fail'    1981    0
2           'fail'    1982    0

Is there an efficient way to achieve this in pandas?

like image 351
Dave Challis Avatar asked Aug 03 '15 12:08

Dave Challis


1 Answers

You can use itertools.product to generate all combinations, then construct a df from this, merge it with your original df along with fillna to fill missing count values with 0:

In [77]:
import itertools
all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
combined = [all_person_ids, all_statuses, all_years]
df1 = pd.DataFrame(columns = ['person_id', 'status', 'year'], data=list(itertools.product(*combined)))
df1

Out[77]:
    person_id status  year
0           0   pass  1980
1           0   pass  1981
2           0   pass  1982
3           0   fail  1980
4           0   fail  1981
5           0   fail  1982
6           1   pass  1980
7           1   pass  1981
8           1   pass  1982
9           1   fail  1980
10          1   fail  1981
11          1   fail  1982
12          2   pass  1980
13          2   pass  1981
14          2   pass  1982
15          2   fail  1980
16          2   fail  1981
17          2   fail  1982

In [82]:    
df1 = df1.merge(df, how='left').fillna(0)
df1

Out[82]:
    person_id status  year  count
0           0   pass  1980      4
1           0   pass  1981      0
2           0   pass  1982      0
3           0   fail  1980      0
4           0   fail  1981      0
5           0   fail  1982      1
6           1   pass  1980      0
7           1   pass  1981      2
8           1   pass  1982      0
9           1   fail  1980      0
10          1   fail  1981      0
11          1   fail  1982      0
12          2   pass  1980      0
13          2   pass  1981      0
14          2   pass  1982      0
15          2   fail  1980      0
16          2   fail  1981      0
17          2   fail  1982      0
like image 90
EdChum Avatar answered Nov 14 '22 12:11

EdChum