Guided by this answer I started to build up pipe for processing columns of dataframe based on its dtype. But after getting some unexpected output and some debugging i ended up with test dataframe and test dtype checking:
# Creating test dataframe
test = pd.DataFrame({'bool' :[False, True], 'int':[-1,2],'float': [-2.5, 3.4],
'compl':np.array([1-1j, 5]),
'dt' :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
'td' :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
'prd' :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
'str' :['s1', 's2'],
'cat' :[1, -1],
'obj' :[[1,2,3], [5435,35,-52,14]]
})
test['cat'] = test['cat'].astype('category')
test
test.dtypes
# Testing types
types = list(test.columns)
df_types = pd.DataFrame(np.zeros((len(types),len(types)), dtype=bool),
index = ['is_'+el for el in types],
columns = types)
for col in test.columns:
df_types.at['is_bool', col] = pd.api.types.is_bool_dtype(test[col])
df_types.at['is_int' , col] = pd.api.types.is_integer_dtype(test[col])
df_types.at['is_float',col] = pd.api.types.is_float_dtype(test[col])
df_types.at['is_compl',col] = pd.api.types.is_complex_dtype(test[col])
df_types.at['is_dt' , col] = pd.api.types.is_datetime64_dtype(test[col])
df_types.at['is_td' , col] = pd.api.types.is_timedelta64_dtype(test[col])
df_types.at['is_prd' , col] = pd.api.types.is_period_dtype(test[col])
df_types.at['is_intrv',col] = pd.api.types.is_interval_dtype(test[col])
df_types.at['is_str' , col] = pd.api.types.is_string_dtype(test[col])
df_types.at['is_cat' , col] = pd.api.types.is_categorical_dtype(test[col])
df_types.at['is_obj' , col] = pd.api.types.is_object_dtype(test[col])
# Styling func
def coloring(df):
clr_g = 'color : green'
clr_r = 'color : red'
mask = ~np.logical_xor(df.values, np.eye(df.shape[0], dtype=bool))
# OUTPUT
return pd.DataFrame(np.where(mask, clr_g, clr_r),
index = df.index,
columns = df.columns)
# OUTPUT colored
df_types.style.apply(coloring, axis=None)
OUTPUT:
bool bool
int int64
float float64
compl complex128
dt datetime64[ns]
td timedelta64[ns]
prd period[D]
intrv interval[float64]
str object
cat category
obj object
Almost everything is good, but this test code produces two questions:
pd.api.types.is_string_dtype
fires
on category
dtype. Why is that? Should it be treated as 'expected'
behavior? is_string_dtype
and is_object_dtype
fires on each
other? This is a bit expected, because even in .dtypes
both types
are noted as object
, but it would be better if someone clarify it
step by step.P.s.: Bonus question - am i right when thinking that pandas has its internal tests that should be passed when building new release (like df_types from test code, but not with 'coloring in red' rather 'recording info about errors')?
EDIT: pandas version 0.24.2
.
This comes down to is_string_dtype
being a fairly loose check, with the implementation even having a TODO note to make it more strict, linking to Issue #15585.
The reason this check is not strict is because there isn't a dedicated string dtype in pandas
, and instead strings are just stored with object
dtype, which could really store anything. As such, a more strict check would likely introduce a performance overhead.
To answer your questions:
This is a result of CategoricalDtype.kind
being set to 'O'
, which is one of the loose checks is_string_dtype
does. This could probably change in the future given the TODO note, so it's not something I'd rely upon.
Since strings are stored as object
dtype it makes sense for is_object_dtype
to fire on strings, and I'd consider this behavior to be reliable as the implementation will almost certainly not change in the immediate future. The reverse is true due to the reliance on dtype.kind
in is_string_dtype
, which has the same caveats as with categoricals described above.
Yes, pandas
has a test suite that will run automatically on various CI services for every PR that's created. The test suite includes checks similar to what you're doing.
One tangentially related note to add: there is a library called fletcher
that uses Apache Arrow to implement a more native string type in a way that's compatible with pandas
. It's still under development and probably doesn't currently have support for all the string operations that pandas
does.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With