Confusing behaviour of Pandas crosstab() function with dataframe containing NaN values

Tags:

I'm using Python 3.4.1 with numpy 0.10.1 and pandas 0.17.0. I have a large dataframe that lists species and gender of individual animals. It's a real-world dataset and there are, inevitably, missing values represented by NaN. A simplified version of the data can be generated as:

Click to copy

import numpy as np
import pandas as pd
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                        'species': ["dog","dog",np.nan,"dog","dog","cat","cat","cat","dog","cat","cat","dog","dog","dog","dog",np.nan,"cat","cat","dog","dog"],
                        'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"]})

Printing the dataframe gives:

Click to copy

    gender  id species
0     male   1     dog
1   female   2     dog
2   female   3     NaN
3     male   4     dog
4     male   5     dog
5   female   6     cat
6   female   7     cat
7      NaN   8     cat
8     male   9     dog
9     male  10     cat
10  female  11     cat
11    male  12     dog
12  female  13     dog
13  female  14     dog
14    male  15     dog
15  female  16     NaN
16    male  17     cat
17  female  18     cat
18     NaN  19     dog
19    male  20     dog

I want to generate a cross-tabulated table to show number of males and females in each species using the following:

Click to copy

pd.crosstab(tempDF['species'],tempDF['gender'])

This produces the following table:

Click to copy

gender   female  male
species              
cat           4     2
dog           3     7

Which is what I'd expect. However, if I include the margins=True option, it produces:

Click to copy

pd.crosstab(tempDF['species'],tempDF['gender'],margins=True)

gender   female  male  All
species                   
cat           4     2    7
dog           3     7   11
All           9     9   20

As you can see, the marginal totals appear to be incorrect, presumably caused by the missing data in the dataframe. Is this intended behaviour? In my mind, it seems very confusing. Surely marginal totals should be totals of rows and columns as they appear in the table and not include any missing data that isn't represented in the table. Including dropna=False does not affect the outcome.

I can delete any row with a NaN before creating the table but that seems to be a lot of extra work and a lot of extra things to think about when doing an analysis. Should I report this as a bug?

760

asked Oct 23 '15 13:10

user1718097

1 Answers

I suppose one workaround would be to convert the NaNs to 'missing' before creating the table and then the cross-tubulation will include columns and rows specifically for missing values:

Click to copy

pd.crosstab(tempDF['species'].fillna('missing'),tempDF['gender'].fillna('missing'),margins=True)

gender   female  male  missing  All
species                            
cat           4     2        1    7
dog           3     7        1   11
missing       2     0        0    2
All           9     9        2   20

Personally, I would like to see that the default behaviour so I wouldn't have to remember to replace all the NaNs in every crosstab calculation.

133

answered Sep 27 '22 22:09

user1718097

Related questions
                            
                                Replace all the occurrences of specific words
                            
                                pymongo error when writing
                            
                                Passing newline within string into a python script from the command line
                            
                                Position 5 subplots in Matplotlib
                            
                                MongoEngine: storing EmbeddedDocument in DictField
                            
                                how can i query data filtered by a JSON Column in SQLAlchemy?
                            
                                Extracting image from video at a given time using OpenCV
                            
                                give parameter(list or array) to in operator - python, sql [duplicate]
                            
                                Scapy packet sniffer triggering an action up on each sniffed packet
                            
                                Isolating py.test DB sessions in Flask-SQLAlchemy
                            
                                Apply function on cumulative values of pandas series
                            
                                Changing constraint naming conventions in Flask-SQLAlchemy
                            
                                Sqlalchemy, raw query and parameters
                            
                                Deploying Django to AWS - WSGIPath refers to a file that does not exist
                            
                                Django Rest Framework nested serializer not showing related data
                            
                                WeasyPrint page size wrong. (8.27in x 11.69 in)
                            
                                Unrecognized commands in bash are captured by the python interpreter [closed]
                            
                                How do you add a simple counter column that increases by one in each row to a Pandas DataFrame?
                            
                                Too many if statements
                            
                                How to label y-axis when using a secondary y-axis?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Confusing behaviour of Pandas crosstab() function with dataframe containing NaN values

Tags:

python

pandas

dataframe

nan

crosstab

user1718097

People also ask

1 Answers

user1718097

Recent Activity

Donate For Us