I'm trying to find a better way to assert the column data type in Python/Pandas of a given dataframe.
For example:
import pandas as pd
t = pd.DataFrame({'a':[1,2,3], 'b':[2,6,0.75], 'c':['foo','bar','beer']})
I would like to assert that specific columns in the data frame are numeric. Here's what I have:
numeric_cols = ['a', 'b'] # These will be given
assert [x in ['int64','float'] for x in [t[y].dtype for y in numeric_cols]]
This last assert line doesn't feel very pythonic. Maybe it is and I'm just cramming it all in one hard to read line. Is there a better way? I would like to write something like:
assert t[numeric_cols].dtype.isnumeric()
I can't seem to find something like that though.
To check the data type in pandas DataFrame we can use the “dtype” attribute. The attribute returns a series with the data type of each column. And the column names of the DataFrame are represented as the index of the resultant series object and the corresponding data types are returned as values of the series object.
A column in a DataFrame can only have one data type. The data type in a DataFrame's single column can be checked using dtype .
You could use ptypes.is_numeric_dtype
to identify numeric columns, ptypes.is_string_dtype
to identify string-like columns, and ptypes.is_datetime64_any_dtype
to identify datetime64 columns:
import pandas as pd
import pandas.api.types as ptypes
t = pd.DataFrame({'a':[1,2,3], 'b':[2,6,0.75], 'c':['foo','bar','beer'],
'd':pd.date_range('2000-1-1', periods=3)})
cols_to_check = ['a', 'b']
assert all(ptypes.is_numeric_dtype(t[col]) for col in cols_to_check)
# True
assert ptypes.is_string_dtype(t['c'])
# True
assert ptypes.is_datetime64_any_dtype(t['d'])
# True
The pandas.api.types
module (which I aliased to ptypes
) has both a is_datetime64_any_dtype
and a is_datetime64_dtype
function. The difference is in how they treat timezone-aware array-likes:
In [239]: ptypes.is_datetime64_any_dtype(pd.DatetimeIndex([1, 2, 3], tz="US/Eastern"))
Out[239]: True
In [240]: ptypes.is_datetime64_dtype(pd.DatetimeIndex([1, 2, 3], tz="US/Eastern"))
Out[240]: False
You can do this
import numpy as np
numeric_dtypes = [np.dtype('int64'), np.dtype('float64')]
# or whatever types you want
assert t[numeric_cols].apply(lambda c: c.dtype).isin(numeric_dtypes).all()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With