Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Caveats while checking dtype in pandas DataFrame

Guided by this answer I started to build up pipe for processing columns of dataframe based on its dtype. But after getting some unexpected output and some debugging i ended up with test dataframe and test dtype checking:

# Creating test dataframe
test = pd.DataFrame({'bool' :[False, True], 'int':[-1,2],'float': [-2.5, 3.4],
                     'compl':np.array([1-1j, 5]),
                     'dt'   :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
                     'td'   :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
                              pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
                     'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                     'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
                     'str'  :['s1', 's2'],
                     'cat'  :[1, -1],
                     'obj'  :[[1,2,3], [5435,35,-52,14]]
                    })
test['cat'] = test['cat'].astype('category')
test
test.dtypes

# Testing types
types = list(test.columns)
df_types = pd.DataFrame(np.zeros((len(types),len(types)), dtype=bool),
                        index = ['is_'+el for el in types],
                        columns = types)
for col in test.columns:
    df_types.at['is_bool', col] = pd.api.types.is_bool_dtype(test[col])
    df_types.at['is_int' , col] = pd.api.types.is_integer_dtype(test[col])
    df_types.at['is_float',col] = pd.api.types.is_float_dtype(test[col])
    df_types.at['is_compl',col] = pd.api.types.is_complex_dtype(test[col])
    df_types.at['is_dt'  , col] = pd.api.types.is_datetime64_dtype(test[col])
    df_types.at['is_td'  , col] = pd.api.types.is_timedelta64_dtype(test[col])
    df_types.at['is_prd' , col] = pd.api.types.is_period_dtype(test[col])
    df_types.at['is_intrv',col] = pd.api.types.is_interval_dtype(test[col])
    df_types.at['is_str' , col] = pd.api.types.is_string_dtype(test[col])
    df_types.at['is_cat' , col] = pd.api.types.is_categorical_dtype(test[col])
    df_types.at['is_obj' , col] = pd.api.types.is_object_dtype(test[col])

# Styling func
def coloring(df):
    clr_g = 'color : green'
    clr_r = 'color : red'
    mask = ~np.logical_xor(df.values, np.eye(df.shape[0], dtype=bool))
    # OUTPUT
    return pd.DataFrame(np.where(mask, clr_g, clr_r),
                        index = df.index,
                        columns = df.columns)

# OUTPUT colored
df_types.style.apply(coloring, axis=None)

OUTPUT: enter image description here

bool                  bool
int                  int64
float              float64
compl           complex128
dt          datetime64[ns]
td         timedelta64[ns]
prd              period[D]
intrv    interval[float64]
str                 object
cat               category
obj                 object

enter image description here

Almost everything is good, but this test code produces two questions:

  1. The most strange here is that pd.api.types.is_string_dtype fires on category dtype. Why is that? Should it be treated as 'expected' behavior?
  2. Why is_string_dtype and is_object_dtype fires on each other? This is a bit expected, because even in .dtypes both types are noted as object, but it would be better if someone clarify it step by step.

P.s.: Bonus question - am i right when thinking that pandas has its internal tests that should be passed when building new release (like df_types from test code, but not with 'coloring in red' rather 'recording info about errors')?

EDIT: pandas version 0.24.2.

like image 266
BeforeFlight Avatar asked May 30 '19 15:05

BeforeFlight


1 Answers

This comes down to is_string_dtype being a fairly loose check, with the implementation even having a TODO note to make it more strict, linking to Issue #15585.

The reason this check is not strict is because there isn't a dedicated string dtype in pandas, and instead strings are just stored with object dtype, which could really store anything. As such, a more strict check would likely introduce a performance overhead.

To answer your questions:

  1. This is a result of CategoricalDtype.kind being set to 'O', which is one of the loose checks is_string_dtype does. This could probably change in the future given the TODO note, so it's not something I'd rely upon.

  2. Since strings are stored as object dtype it makes sense for is_object_dtype to fire on strings, and I'd consider this behavior to be reliable as the implementation will almost certainly not change in the immediate future. The reverse is true due to the reliance on dtype.kind in is_string_dtype, which has the same caveats as with categoricals described above.

  3. Yes, pandas has a test suite that will run automatically on various CI services for every PR that's created. The test suite includes checks similar to what you're doing.

One tangentially related note to add: there is a library called fletcher that uses Apache Arrow to implement a more native string type in a way that's compatible with pandas. It's still under development and probably doesn't currently have support for all the string operations that pandas does.

like image 90
root Avatar answered Oct 22 '22 00:10

root