Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas category shows different behaviour when equating dtype = 'float64' and dtype = 'category'

Tags:

python

pandas

I try to use a loop to do some operations on the Pandas numeric and category columns.

df = sns.load_dataset('diamonds')
print(df.dtypes,'\n')

carat       float64
cut        category
color      category
clarity    category
depth       float64
table       float64
price         int64
x           float64
y           float64
z           float64
dtype: object 

In the following codes, I just simply cut and paste 'float64' and 'category' from the preceding step output.

for i in df.columns:
    if df[i].dtypes in ['float64']:
        print(i)
        
for i in df.columns:
    if df[i].dtypes in ['category']:
        print(i)

I found that it works for 'float64' but generates an error for 'category'.

Why is this ? Thanks very much !!!

carat
depth
table
x
y
z
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-74-8e6aa9d4726e> in <module>
      4 
      5 for i in df.columns:
----> 6     if df[i].dtypes in ['category']:
      7         print(i)

TypeError: data type 'category' not understood
like image 756
EBDS Avatar asked Oct 14 '22 20:10

EBDS


1 Answers

Solution

Try using pd.api.types.is_categorical_dtype:

for i in df.columns:
    if pd.api.types.is_categorical_dtype(df[i]):
        print(i)

Or check the dtype name:

for i in df.columns:
    if df[i].dtype.name == 'category':
        print(i)

Output:

cut
color
clarity

Explanation:

This is a bug in Pandas, here is the GitHub issue, one sentence is:

df.dtypes[colname] == 'category' evaluates as True for categorical columns and raises TypeError: data type "category" not understood for np.float64 columns.

So actually, it works, it does give True for categorical columns, but the problem here is that the numpy float64 dtype checking isn't cooperated with pandas dtypes, such as category.

If you make order the columns differently, having the first 3 columns as categorical dtype columns, it will show those column names, but once float columns come, it will raise error due to numpy and pandas type issue:

>>> df = df.iloc[:, 1:]
>>> df
             cut color clarity  depth  table  price     x     y     z
0          Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1        Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2           Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3        Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4           Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75
...          ...   ...     ...    ...    ...    ...   ...   ...   ...
53935      Ideal     D     SI1   60.8   57.0   2757  5.75  5.76  3.50
53936       Good     D     SI1   63.1   55.0   2757  5.69  5.75  3.61
53937  Very Good     D     SI1   62.8   60.0   2757  5.66  5.68  3.56
53938    Premium     H     SI2   61.0   58.0   2757  6.15  6.12  3.74
53939      Ideal     D     SI2   62.2   55.0   2757  5.83  5.87  3.64

[53940 rows x 9 columns]
>>> for i in df.columns:
    if df[i].dtypes in ['category']:
        print(i)

        
cut
color
clarity
Traceback (most recent call last):
  File "<pyshell#138>", line 2, in <module>
    if df[i].dtypes in ['category']:
TypeError: data type 'category' not understood
>>> 

As you can see, it did output the columns, but once np.float64 dtyped columns appear, the numpy __eq__ magic method would throw an error from numpy backend.

like image 182
U12-Forward Avatar answered Oct 21 '22 09:10

U12-Forward