Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use pandas' pivot_table to aggregate over a column with missing values?

Tags:

Can I use pandas pivot_table to aggregate over a column with missing values and have those missing values included as separate category?

In:
df = pd.DataFrame({'a': pd.Series(['X', 'X', 'Y', 'Y', 'N', 'N'], dtype='category'), 
                   'b': pd.Series([None, None, 'd', 'd', 'd', 'd'], dtype='category')})

Out:
    a   b
0   X   NaN
1   X   NaN
2   Y   d
3   Y   d
4   N   d
5   N   d

In:
df.groupby('a')['b'].apply(lambda x: x.value_counts(dropna=False)).unstack(1)

Out:
    NaN d
a       
N   NaN 2.0
X   2.0 0.0
Y   NaN 2.0

Can I achieve the same result using pandas pivot_table? If yes than how? Thanks.

like image 216
Roman Shevtsiv Avatar asked Nov 23 '20 20:11

Roman Shevtsiv


People also ask

How do Pandas deal with missing values?

Checking for missing values using isnull() and notnull() In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.

How do you get the sum of missing values in Pandas?

When applied to a dataframe, Pandas isna() function return boolean dataframe with True with the element is missing value and False when it is not a missing value. We can use Pandas' sum() function to get the counts of missing values per each column in the dataframe. By default, Pandas sum() adds across columns.

What is pivot_table in Pandas?

Pivot table in pandas is an excellent tool to summarize one or more numeric variable based on two other categorical variables. Pivot tables in pandas are popularly seen in MS Excel files. In python, Pivot tables of pandas dataframes can be created using the command: pandas. pivot_table .


1 Answers

For some unknown reason, dtype="category" does not work with pivot_table() when counting NaN values. Casting them to regular strings enables regular pivot_table(aggfunc="size").

df.astype(str).pivot_table(index="a", columns="b", aggfunc="size")    

Result

b    d  nan
a          
N  2.0  NaN
X  NaN  2.0
Y  2.0  NaN

One can optionally do .fillna(0) to replace nans with 0s

like image 112
Bill Huang Avatar answered Sep 30 '22 19:09

Bill Huang