I'm using Python 3.4.1 with numpy 0.10.1 and pandas 0.17.0. I have a large dataframe that lists species and gender of individual animals. It's a real-world dataset and there are, inevitably, missing values represented by NaN. A simplified version of the data can be generated as:
import numpy as np
import pandas as pd
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'species': ["dog","dog",np.nan,"dog","dog","cat","cat","cat","dog","cat","cat","dog","dog","dog","dog",np.nan,"cat","cat","dog","dog"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"]})
Printing the dataframe gives:
gender id species
0 male 1 dog
1 female 2 dog
2 female 3 NaN
3 male 4 dog
4 male 5 dog
5 female 6 cat
6 female 7 cat
7 NaN 8 cat
8 male 9 dog
9 male 10 cat
10 female 11 cat
11 male 12 dog
12 female 13 dog
13 female 14 dog
14 male 15 dog
15 female 16 NaN
16 male 17 cat
17 female 18 cat
18 NaN 19 dog
19 male 20 dog
I want to generate a cross-tabulated table to show number of males and females in each species using the following:
pd.crosstab(tempDF['species'],tempDF['gender'])
This produces the following table:
gender female male
species
cat 4 2
dog 3 7
Which is what I'd expect. However, if I include the margins=True option, it produces:
pd.crosstab(tempDF['species'],tempDF['gender'],margins=True)
gender female male All
species
cat 4 2 7
dog 3 7 11
All 9 9 20
As you can see, the marginal totals appear to be incorrect, presumably caused by the missing data in the dataframe. Is this intended behaviour? In my mind, it seems very confusing. Surely marginal totals should be totals of rows and columns as they appear in the table and not include any missing data that isn't represented in the table. Including dropna=False does not affect the outcome.
I can delete any row with a NaN before creating the table but that seems to be a lot of extra work and a lot of extra things to think about when doing an analysis. Should I report this as a bug?
The crosstab() function is used to compute a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.
NaN means missing data Missing data is labelled NaN. Note that np. nan is not equal to Python None.
I suppose one workaround would be to convert the NaNs to 'missing' before creating the table and then the cross-tubulation will include columns and rows specifically for missing values:
pd.crosstab(tempDF['species'].fillna('missing'),tempDF['gender'].fillna('missing'),margins=True)
gender female male missing All
species
cat 4 2 1 7
dog 3 7 1 11
missing 2 0 0 2
All 9 9 2 20
Personally, I would like to see that the default behaviour so I wouldn't have to remember to replace all the NaNs in every crosstab calculation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With