Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Asserting column(s) data type in Pandas

I'm trying to find a better way to assert the column data type in Python/Pandas of a given dataframe.

For example:

import pandas as pd
t = pd.DataFrame({'a':[1,2,3], 'b':[2,6,0.75], 'c':['foo','bar','beer']})

I would like to assert that specific columns in the data frame are numeric. Here's what I have:

numeric_cols = ['a', 'b']  # These will be given
assert [x in ['int64','float'] for x in [t[y].dtype for y in numeric_cols]]

This last assert line doesn't feel very pythonic. Maybe it is and I'm just cramming it all in one hard to read line. Is there a better way? I would like to write something like:

assert t[numeric_cols].dtype.isnumeric()

I can't seem to find something like that though.

like image 236
nfmcclure Avatar asked Feb 19 '15 00:02

nfmcclure


People also ask

How do I get Dtype of pandas column?

To check the data type in pandas DataFrame we can use the “dtype” attribute. The attribute returns a series with the data type of each column. And the column names of the DataFrame are represented as the index of the resultant series object and the corresponding data types are returned as values of the series object.

Can pandas column have different data types?

A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype .


2 Answers

You could use ptypes.is_numeric_dtype to identify numeric columns, ptypes.is_string_dtype to identify string-like columns, and ptypes.is_datetime64_any_dtype to identify datetime64 columns:

import pandas as pd
import pandas.api.types as ptypes

t = pd.DataFrame({'a':[1,2,3], 'b':[2,6,0.75], 'c':['foo','bar','beer'],
              'd':pd.date_range('2000-1-1', periods=3)})
cols_to_check = ['a', 'b']

assert all(ptypes.is_numeric_dtype(t[col]) for col in cols_to_check)
# True
assert ptypes.is_string_dtype(t['c'])
# True
assert ptypes.is_datetime64_any_dtype(t['d'])
# True

The pandas.api.types module (which I aliased to ptypes) has both a is_datetime64_any_dtype and a is_datetime64_dtype function. The difference is in how they treat timezone-aware array-likes:

In [239]: ptypes.is_datetime64_any_dtype(pd.DatetimeIndex([1, 2, 3], tz="US/Eastern"))
Out[239]: True

In [240]: ptypes.is_datetime64_dtype(pd.DatetimeIndex([1, 2, 3], tz="US/Eastern"))
Out[240]: False
like image 129
unutbu Avatar answered Sep 21 '22 12:09

unutbu


You can do this

import numpy as np
numeric_dtypes = [np.dtype('int64'), np.dtype('float64')]
# or whatever types you want

assert t[numeric_cols].apply(lambda c: c.dtype).isin(numeric_dtypes).all()
like image 36
ely Avatar answered Sep 22 '22 12:09

ely