Pandas value_counts(normalize=True) fails when an extension datatype is used. For example, when creating an int8 Series containing pd.NA would typically use Int8 extension datatype but an error occurs: AttributeError: 'IntegerArray' object has no attribute 'sum'. What's the workaround?
pd.Series([1,pd.NA],dtype='Int8').value_counts(normalize=True)
This is believed to be a regression bug, see GH33317. Good news is that this is fixed on pandas 1.1.
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
pd.Series([1, pd.NA], dtype='Int8').value_counts(normalize=True)
1 1.0
dtype: float64
More Examples
s = pd.Series([1, 1, 1, 2, 2, 3, pd.NA], dtype='Int8')
s.value_counts()
1 3
2 2
3 1
dtype: Int64
s.value_counts(normalize=True)
1 0.500000
2 0.333333
3 0.166667
dtype: float64
s.value_counts(normalize=True, dropna=False)
1 0.428571
2 0.285714
NaN 0.142857
3 0.142857
dtype: float64
Each of the following can be used to workaround the issue:
# 1) works if you're ok with dropping NA
pd.Series([1,pd.NA],dtype='Int8').dropna().astype(int).value_counts(normalize=True)
# 2) works if you're ok with switching to a non-extension datatype such as float
pd.Series([1,pd.NA],dtype='Int8').astype(float).value_counts(normalize=True)
# 3) The issue may be fixed in a future versions of pandas. Try using a pandas version >= 1.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With