how to check the dtype of a column in python pandas

1. Comparing types directly via `==` (accepted answer).

Despite the fact that this is accepted answer and has most upvotes count, I think this method should not be used at all. Because in fact this approach is discouraged in python as mentioned several times here.
But if one still want to use it - should be aware of some pandas-specific dtypes like pd.CategoricalDType, pd.PeriodDtype, or pd.IntervalDtype. Here one have to use extra type( ) in order to recognize dtype correctly:

s = pd.Series([pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')])
s
s.dtype == pd.PeriodDtype   # Not working
type(s.dtype) == pd.PeriodDtype # working 

>>> 0    2002-03-01
>>> 1    2012-02-01
>>> dtype: period[D]
>>> False
>>> True

Another caveat here is that type should be pointed out precisely:

s = pd.Series([1,2])
s
s.dtype == np.int64 # Working
s.dtype == np.int32 # Not working

>>> 0    1
>>> 1    2
>>> dtype: int64
>>> True
>>> False

2. `isinstance()` approach.

This method has not been mentioned in answers so far.

So if direct comparing of types is not a good idea - lets try built-in python function for this purpose, namely - isinstance().
It fails just in the beginning, because assumes that we have some objects, but pd.Series or pd.DataFrame may be used as just empty containers with predefined dtype but no objects in it:

s = pd.Series([], dtype=bool)
s

>>> Series([], dtype: bool)

But if one somehow overcome this issue, and wants to access each object, for example, in the first row and checks its dtype like something like that:

df = pd.DataFrame({'int': [12, 2], 'dt': [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]},
                  index = ['A', 'B'])
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (dtype('int64'), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

It will be misleading in the case of mixed type of data in single column:

df2 = pd.DataFrame({'data': [12, pd.Timestamp('2013-01-02')]},
                  index = ['A', 'B'])
for col in df2.columns:
    df2[col].dtype, 'is_int64 = %s' % isinstance(df2.loc['A', col], np.int64)

>>> (dtype('O'), 'is_int64 = False')

And last but not least - this method cannot directly recognize Category dtype. As stated in docs:

Returning a single item from categorical data will also return the value, not a categorical of length “1”.

df['int'] = df['int'].astype('category')
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (CategoricalDtype(categories=[2, 12], ordered=False), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

So this method is also almost inapplicable.

3. `df.dtype.kind` approach.

This method yet may work with empty pd.Series or pd.DataFrames but has another problems.

First - it is unable to differ some dtypes:

df = pd.DataFrame({'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                   'str'  :['s1', 's2'],
                   'cat'  :[1, -1]})
df['cat'] = df['cat'].astype('category')
for col in df:
    # kind will define all columns as 'Object'
    print (df[col].dtype, df[col].dtype.kind)

>>> period[D] O
>>> object O
>>> category O

Second, what is actually still unclear for me, it even returns on some dtypes None.

4. `df.select_dtypes` approach.

This is almost what we want. This method designed inside pandas so it handles most corner cases mentioned earlier - empty DataFrames, differs numpy or pandas-specific dtypes well. It works well with single dtype like .select_dtypes('bool'). It may be used even for selecting groups of columns based on dtype:

test = pd.DataFrame({'bool' :[False, True], 'int64':[-1,2], 'int32':[-1,2],'float': [-2.5, 3.4],
                     'compl':np.array([1-1j, 5]),
                     'dt'   :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
                     'td'   :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
                              pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
                     'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                     'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
                     'str'  :['s1', 's2'],
                     'cat'  :[1, -1],
                     'obj'  :[[1,2,3], [5435,35,-52,14]]
                    })
test['int32'] = test['int32'].astype(np.int32)
test['cat'] = test['cat'].astype('category')

Like so, as stated in the docs:

test.select_dtypes('number')

>>>     int64   int32   float   compl   td
>>> 0      -1      -1   -2.5    (1-1j)  -1693 days
>>> 1       2       2    3.4    (5+0j)   3531 days

On may think that here we see first unexpected (at used to be for me: question) results - TimeDelta is included into output DataFrame. But as answered in contrary it should be so, but one have to be aware of it. Note that bool dtype is skipped, that may be also undesired for someone, but it's due to bool and number are in different "subtrees" of numpy dtypes. In case with bool, we may use test.select_dtypes(['bool']) here.

Next restriction of this method is that for current version of pandas (0.24.2), this code: test.select_dtypes('period') will raise NotImplementedError.

And another thing is that it's unable to differ strings from other objects:

test.select_dtypes('object')

>>>     str     obj
>>> 0    s1     [1, 2, 3]
>>> 1    s2     [5435, 35, -52, 14]

But this is, first - already mentioned in the docs. And second - is not the problem of this method, rather the way strings are stored in DataFrame. But anyway this case have to have some post processing.

5. `df.api.types.is_XXX_dtype` approach.

This one is intended to be most robust and native way to achieve dtype recognition (path of the module where functions resides says by itself) as i suppose. And it works almost perfectly, but still have at least one caveat and still have to somehow distinguish string columns.

Besides, this may be subjective, but this approach also has more 'human-understandable' number dtypes group processing comparing with .select_dtypes('number'):

for col in test.columns:
    if pd.api.types.is_numeric_dtype(test[col]):
        print (test[col].dtype)

>>> bool
>>> int64
>>> int32
>>> float64
>>> complex128

No timedelta and bool is included. Perfect.

My pipeline exploits exactly this functionality at this moment of time, plus a bit of post hand processing.

Output.

Hope I was able to argument the main point - that all discussed approaches may be used, but only pd.DataFrame.select_dtypes() and pd.api.types.is_XXX_dtype should be really considered as the applicable ones.

If you want to mark the type of a dataframe column as a string, you can do:

df['A'].dtype.kind

An example:

In [8]: df = pd.DataFrame([[1,'a',1.2],[2,'b',2.3]])
In [9]: df[0].dtype.kind, df[1].dtype.kind, df[2].dtype.kind
Out[9]: ('i', 'O', 'f')

The answer for your code:

for y in agg.columns:
    if(agg[y].dtype.kind == 'f' or agg[y].dtype.kind == 'i'):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

Note:

uint and UInt are of kind u, not kind i.
Consider the dtype introspection utility functions, e.g. pd.api.types.is_integer_dtype.

To pretty print the column data types

To check the data types after, for example, an import from a file

def printColumnInfo(df):
    template="%-8s %-30s %s"
    print(template % ("Type", "Column Name", "Example Value"))
    print("-"*53)
    for c in df.columns:
        print(template % (df[c].dtype, c, df[c].iloc[1]) )

Illustrative output:

Type     Column Name                    Example Value
-----------------------------------------------------
int64    Age                            49
object   Attrition                      No
object   BusinessTravel                 Travel_Frequently
float64  DailyRate                      279.0

Related questions
                            
                                How to create an object for a Django model with a many to many field?
                            
                                Select 50 items from list at random to write to file
                            
                                "Inner exception" (with traceback) in Python?
                            
                                Remove characters except digits from string using Python?
                            
                                How to remove the left part of a string?
                            
                                python numpy ValueError: operands could not be broadcast together with shapes
                            
                                Peak-finding algorithm for Python/SciPy
                            
                                Python3 integer division [duplicate]
                            
                                Converting list to *args when calling function [duplicate]
                            
                                what is the difference between 'transform' and 'fit_transform' in sklearn
                            
                                Accessing items in an collections.OrderedDict by index
                            
                                Re-raise exception with a different type and message, preserving existing information
                            
                                Convert a 1D array to a 2D array in numpy
                            
                                How do I revert to a previous package in Anaconda?
                            
                                Replace values in list using Python [duplicate]
                            
                                Python Matplotlib figure title overlaps axes label when using twiny
                            
                                List all base classes in a hierarchy of given class?
                            
                                Check if key exists and iterate the JSON array using Python
                            
                                Is there a "do ... until" in Python? [duplicate]
                            
                                How do you create different variable names while in a loop? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to check the dtype of a column in python pandas

Tags:

python

pandas

People also ask

1. Comparing types directly via `==` (accepted answer).

2. `isinstance()` approach.

3. `df.dtype.kind` approach.

4. `df.select_dtypes` approach.

5. `df.api.types.is_XXX_dtype` approach.

Output.

To pretty print the column data types

Recent Activity

Donate For Us

how to check the dtype of a column in python pandas

Tags:

python

pandas

People also ask

1. Comparing types directly via == (accepted answer).

2. isinstance() approach.

3. df.dtype.kind approach.

4. df.select_dtypes approach.

5. df.api.types.is_XXX_dtype approach.

Output.

To pretty print the column data types

Related questions

Recent Activity

Donate For Us

1. Comparing types directly via `==` (accepted answer).

2. `isinstance()` approach.

3. `df.dtype.kind` approach.

4. `df.select_dtypes` approach.

5. `df.api.types.is_XXX_dtype` approach.