I am new to Pandas... I want to a simple and generic way to find which columns are categorical
in my DataFrame
, when I don't manually specify each column type, unlike in this SO question. The df
is created with:
import pandas as pd
df = pd.read_csv("test.csv", header=None)
e.g.
0 1 2 3 4
0 1.539240 0.423437 -0.687014 Chicago Safari
1 0.815336 0.913623 1.800160 Boston Safari
2 0.821214 -0.824839 0.483724 New York Safari
.
UPDATE (2018/02/04) The question assumes numerical columns are NOT categorical, @Zero's accepted answer solves this.
BE CAREFUL - As @Sagarkar's comment points out that's not always true. The difficulty is that Data Types and Categorical/Ordinal/Nominal types are orthogonal concepts, thus mapping between them isn't straightforward. @Jeff's answer below specifies the precise manner to achieve the manual mapping.
Often in real-time, data includes the text columns, which are repetitive. Features like gender, country, and codes are always repetitive. These are the examples for categorical data. Categorical variables can take on only a limited, and usually fixed number of possible values.
You could use df._get_numeric_data()
to get numeric columns and then find out categorical columns
In [66]: cols = df.columns In [67]: num_cols = df._get_numeric_data().columns In [68]: num_cols Out[68]: Index([u'0', u'1', u'2'], dtype='object') In [69]: list(set(cols) - set(num_cols)) Out[69]: ['3', '4']
The way I found was updating to Pandas v0.16.0, then excluding number dtypes with:
df.select_dtypes(exclude=["number","bool_","object_"])
Which works, providing no types are changed and no more are added to NumPy. The suggestion in the question's comments by @Jeff suggests include=["category"]
, but that didn't seem to work.
NumPy Types: link
For posterity. The canonical method to select dtypes is .select_dtypes
. You can specify an actual numpy dtype or convertible, or 'category' which not a numpy dtype.
In [1]: df = DataFrame({'A' : Series(range(3)).astype('category'), 'B' : range(3), 'C' : list('abc'), 'D' : np.random.randn(3) })
In [2]: df
Out[2]:
A B C D
0 0 0 a 0.141296
1 1 1 b 0.939059
2 2 2 c -2.305019
In [3]: df.select_dtypes(include=['category'])
Out[3]:
A
0 0
1 1
2 2
In [4]: df.select_dtypes(include=['object'])
Out[4]:
C
0 a
1 b
2 c
In [5]: df.select_dtypes(include=['object']).dtypes
Out[5]:
C object
dtype: object
In [6]: df.select_dtypes(include=['category','int']).dtypes
Out[6]:
A category
B int64
dtype: object
In [7]: df.select_dtypes(include=['category','int','float']).dtypes
Out[7]:
A category
B int64
D float64
dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With