Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas crosstab - How to print rows/columns for values that don't exist in the data sets?

I am a beginner with pandas at best and I couldn't find a solution to this problem anywhere.

Let's say I have two variables: variable1, variable2.

They can have the following predefined values:

variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']

However, the current data set only has some of those values:

df = pd.DataFrame({variable1 : ['1', '9', '20'],
                  variable2 : ['2', '2', '6']})

When crossing the variables:

pd.crosstab(df.variable1, df.variable2)

I get:

variable2  2  6
variable1      
1          1  0
20         0  1
9          1  0

Is there a way to put all the possible categorical values in both the columns and the rows even if the current data set does not have all of them? The goal is to have a table of the same size when running the script with an updated data set which may have the values that were not present in the previous data set.

like image 242
somethingsomethingdangerzone Avatar asked Oct 27 '16 09:10

somethingsomethingdangerzone


2 Answers

Use DataFrame.reindex:

variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']


df = pd.DataFrame({'variable1' : ['1', '9', '20'],
                  'variable2' : ['2', '2', '6']})

print (df)                  
  variable1 variable2
0         1         2
1         9         2
2        20         6

df = pd.crosstab(df.variable1, df.variable2)
df = df.reindex(index=variable1, columns=variable2, fill_value=0)
print (df)
variable2  2  5  6
variable1         
1          1  0  0
4          0  0  0
9          1  0  0
15         0  0  0
20         0  0  1

from collections import OrderedDict


valuelabels = OrderedDict([('S8', [['1', 'Medical oncology'], 
                                   ['2', 'Hematology'], 
                                   ['3', 'Hematology/Oncology'], 
                                   ['4', 'Other']]), 
                           ('S9', [['1', 'Academic / Teaching Hospital'], 
                                   ['2', 'Community-Based Solo Private Practice'], 
                                   ['3', 'Community-Based Group Private Practice (record practice size )'], ['4', 'Community Non-Teaching Hospital'], 
                                   ['5', 'Comprehensive Cancer Center'], 
                                   ['6', 'Other (specify)']])])
#print (valuelabels)


df = pd.DataFrame({'variable1' : ['1', '2', '4'],
                  'variable2' : ['2', '3', '1']})

table = pd.crosstab(df.variable1, df.variable2)      
print (table)
variable2  1  2  3
variable1         
1          0  1  0
2          0  0  1
4          1  0  0
d1 = dict(list(zip([a[0] for a in valuelabels['S8']], [a[1] for a in valuelabels['S8']])))
print (d1)
{'4': 'Other', '1': 'Medical oncology', '2': 'Hematology', '3': 'Hematology/Oncology'}

d2 = dict(list(zip([a[0] for a in valuelabels['S9']], [a[1] for a in valuelabels['S9']])))
print (d2)
{'1': 'Academic / Teaching Hospital', 
'3': 'Community-Based Group Private Practice (record practice size )', 
'4': 'Community Non-Teaching Hospital', 
'6': 'Other (specify)', 
'2': 'Community-Based Solo Private Practice', 
'5': 'Comprehensive Cancer Center'}
table = table.reindex(index=[a[0] for a in valuelabels['S8']], 
                      columns=[a[0] for a in valuelabels['S9'], fill_value=0)
print (table)
variable2  1  2  3  4  5  6
variable1                  
1          0  1  0  0  0  0
2          0  0  1  0  0  0
3          0  0  0  0  0  0
4          1  0  0  0  0  0

table.index = table.index.to_series().map(d1).values
table.columns = table.columns.to_series().map(d2).values
print (table)
                     Academic / Teaching Hospital  \
Medical oncology                                0   
Hematology                                      0   
Hematology/Oncology                             0   
Other                                           1   

                     Community-Based Solo Private Practice  \
Medical oncology                                         1   
Hematology                                               0   
Hematology/Oncology                                      0   
Other                                                    0   

                     Community-Based Group Private Practice (record practice size )  \
Medical oncology                                                     0                
Hematology                                                           1                
Hematology/Oncology                                                  0                
Other                                                                0                

                     Community Non-Teaching Hospital  \
Medical oncology                                   0   
Hematology                                         0   
Hematology/Oncology                                0   
Other                                              0   

                     Comprehensive Cancer Center  Other (specify)  
Medical oncology                               0                0  
Hematology                                     0                0  
Hematology/Oncology                            0                0  
Other                                          0                0  
like image 66
jezrael Avatar answered Nov 13 '22 21:11

jezrael


You can use reindex:

ct = pd.crosstab(df.variable1, df.variable2)
ct.reindex(index=variable1, columns=variable2).fillna(0).astype('int')
Out: 
variable2  2  5  6
variable1         
1          1  0  0
4          0  0  0
9          1  0  0
15         0  0  0
20         0  0  1
like image 45
ayhan Avatar answered Nov 13 '22 19:11

ayhan