Use Dataframe. dtypes to get Data types of columns in Dataframe. In Python's pandas module Dataframe class provides an attribute to get the data type information of each columns i.e. It returns a series object containing data type information of each column.
A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype . Make conscious decisions about how to manage missing data. A DataFrame can be saved to a CSV file using the to_csv function.
To check the data type of a Series we have a dedicated attribute in the pandas series properties. The “dtype” is a pandas attribute that is used to verify data type in a pandas Series object. This attribute will return a dtype object which represents the data type of the given series.
You can access the data-type of a column with dtype
:
for y in agg.columns:
if(agg[y].dtype == np.float64 or agg[y].dtype == np.int64):
treat_numeric(agg[y])
else:
treat_str(agg[y])
In pandas 0.20.2
you can do:
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
is_string_dtype(df['A'])
>>>> True
is_numeric_dtype(df['B'])
>>>> True
So your code becomes:
for y in agg.columns:
if (is_string_dtype(agg[y])):
treat_str(agg[y])
elif (is_numeric_dtype(agg[y])):
treat_numeric(agg[y])
I know this is a bit of an old thread but with pandas 19.02, you can do:
df.select_dtypes(include=['float64']).apply(your_function)
df.select_dtypes(exclude=['string','object']).apply(your_other_function)
http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.select_dtypes.html
Asked question title is general, but authors use case stated in the body of the question is specific. So any other answers may be used.
But in order to fully answer the title question it should be clarified that it seems like all of the approaches may fail in some cases and require some rework. I reviewed all of them (and some additional) in decreasing of reliability order (in my opinion):
==
(accepted answer).Despite the fact that this is accepted answer and has most upvotes count, I think this method should not be used at all. Because in fact this approach is discouraged in python as mentioned several times here.
But if one still want to use it - should be aware of some pandas-specific dtypes like pd.CategoricalDType
, pd.PeriodDtype
, or pd.IntervalDtype
. Here one have to use extra type( )
in order to recognize dtype correctly:
s = pd.Series([pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')])
s
s.dtype == pd.PeriodDtype # Not working
type(s.dtype) == pd.PeriodDtype # working
>>> 0 2002-03-01
>>> 1 2012-02-01
>>> dtype: period[D]
>>> False
>>> True
Another caveat here is that type should be pointed out precisely:
s = pd.Series([1,2])
s
s.dtype == np.int64 # Working
s.dtype == np.int32 # Not working
>>> 0 1
>>> 1 2
>>> dtype: int64
>>> True
>>> False
isinstance()
approach.This method has not been mentioned in answers so far.
So if direct comparing of types is not a good idea - lets try built-in python function for this purpose, namely - isinstance()
.
It fails just in the beginning, because assumes that we have some objects, but pd.Series
or pd.DataFrame
may be used as just empty containers with predefined dtype
but no objects in it:
s = pd.Series([], dtype=bool)
s
>>> Series([], dtype: bool)
But if one somehow overcome this issue, and wants to access each object, for example, in the first row and checks its dtype like something like that:
df = pd.DataFrame({'int': [12, 2], 'dt': [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]},
index = ['A', 'B'])
for col in df.columns:
df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)
>>> (dtype('int64'), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')
It will be misleading in the case of mixed type of data in single column:
df2 = pd.DataFrame({'data': [12, pd.Timestamp('2013-01-02')]},
index = ['A', 'B'])
for col in df2.columns:
df2[col].dtype, 'is_int64 = %s' % isinstance(df2.loc['A', col], np.int64)
>>> (dtype('O'), 'is_int64 = False')
And last but not least - this method cannot directly recognize Category
dtype. As stated in docs:
Returning a single item from categorical data will also return the value, not a categorical of length “1”.
df['int'] = df['int'].astype('category')
for col in df.columns:
df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)
>>> (CategoricalDtype(categories=[2, 12], ordered=False), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')
So this method is also almost inapplicable.
df.dtype.kind
approach.This method yet may work with empty pd.Series
or pd.DataFrames
but has another problems.
First - it is unable to differ some dtypes:
df = pd.DataFrame({'prd' :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
'str' :['s1', 's2'],
'cat' :[1, -1]})
df['cat'] = df['cat'].astype('category')
for col in df:
# kind will define all columns as 'Object'
print (df[col].dtype, df[col].dtype.kind)
>>> period[D] O
>>> object O
>>> category O
Second, what is actually still unclear for me, it even returns on some dtypes None.
df.select_dtypes
approach.This is almost what we want. This method designed inside pandas so it handles most corner cases mentioned earlier - empty DataFrames, differs numpy or pandas-specific dtypes well. It works well with single dtype like .select_dtypes('bool')
. It may be used even for selecting groups of columns based on dtype:
test = pd.DataFrame({'bool' :[False, True], 'int64':[-1,2], 'int32':[-1,2],'float': [-2.5, 3.4],
'compl':np.array([1-1j, 5]),
'dt' :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
'td' :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
'prd' :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
'str' :['s1', 's2'],
'cat' :[1, -1],
'obj' :[[1,2,3], [5435,35,-52,14]]
})
test['int32'] = test['int32'].astype(np.int32)
test['cat'] = test['cat'].astype('category')
Like so, as stated in the docs:
test.select_dtypes('number')
>>> int64 int32 float compl td
>>> 0 -1 -1 -2.5 (1-1j) -1693 days
>>> 1 2 2 3.4 (5+0j) 3531 days
On may think that here we see first unexpected (at used to be for me: question) results - TimeDelta
is included into output DataFrame
. But as answered in contrary it should be so, but one have to be aware of it. Note that bool
dtype is skipped, that may be also undesired for someone, but it's due to bool
and number
are in different "subtrees" of numpy dtypes. In case with bool, we may use test.select_dtypes(['bool'])
here.
Next restriction of this method is that for current version of pandas (0.24.2), this code: test.select_dtypes('period')
will raise NotImplementedError
.
And another thing is that it's unable to differ strings from other objects:
test.select_dtypes('object')
>>> str obj
>>> 0 s1 [1, 2, 3]
>>> 1 s2 [5435, 35, -52, 14]
But this is, first - already mentioned in the docs. And second - is not the problem of this method, rather the way strings are stored in DataFrame
. But anyway this case have to have some post processing.
df.api.types.is_XXX_dtype
approach.This one is intended to be most robust and native way to achieve dtype recognition (path of the module where functions resides says by itself) as i suppose. And it works almost perfectly, but still have at least one caveat and still have to somehow distinguish string columns.
Besides, this may be subjective, but this approach also has more 'human-understandable' number
dtypes group processing comparing with .select_dtypes('number')
:
for col in test.columns:
if pd.api.types.is_numeric_dtype(test[col]):
print (test[col].dtype)
>>> bool
>>> int64
>>> int32
>>> float64
>>> complex128
No timedelta
and bool
is included. Perfect.
My pipeline exploits exactly this functionality at this moment of time, plus a bit of post hand processing.
Hope I was able to argument the main point - that all discussed approaches may be used, but only pd.DataFrame.select_dtypes()
and pd.api.types.is_XXX_dtype
should be really considered as the applicable ones.
If you want to mark the type of a dataframe column as a string, you can do:
df['A'].dtype.kind
An example:
In [8]: df = pd.DataFrame([[1,'a',1.2],[2,'b',2.3]])
In [9]: df[0].dtype.kind, df[1].dtype.kind, df[2].dtype.kind
Out[9]: ('i', 'O', 'f')
The answer for your code:
for y in agg.columns:
if(agg[y].dtype.kind == 'f' or agg[y].dtype.kind == 'i'):
treat_numeric(agg[y])
else:
treat_str(agg[y])
Note:
uint
and UInt
are of kind u
, not kind i
.pd.api.types.is_integer_dtype
.To check the data types after, for example, an import from a file
def printColumnInfo(df):
template="%-8s %-30s %s"
print(template % ("Type", "Column Name", "Example Value"))
print("-"*53)
for c in df.columns:
print(template % (df[c].dtype, c, df[c].iloc[1]) )
Illustrative output:
Type Column Name Example Value
-----------------------------------------------------
int64 Age 49
object Attrition No
object BusinessTravel Travel_Frequently
float64 DailyRate 279.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With