Given a Pandas df
with different data types, df.select_dtypes
can be very useful to keep only desired columns or to get rid of unwanted columns for a particular application.
However, there seems to be no way of addressing string
dtypes using this method.
From the docs (emphasis mine):
ValueError
If both of include and exclude are empty
If include and exclude have overlapping elements
If any kind of string dtype is passed in.
and
To select strings you must use the object dtype, but note that this will return all object dtype columns
Indeed, using df.select_dtypes(exclude=['str'])
raises an error (although it is a TypeError
and not a ValueError
as the docs claim) and using df.select_dtypes(exclude=['object'])
removes all object
columns, not just string
columns.
Given a df
like this:
df = pd.DataFrame({'int_col':[0,1,2,3,4],
'dict_col':[dict() for i in range(5)],
'str_col':list('abcde')})
and considering that
df.dtypes
is object
for both str_col
and dict_col
:
What is the best way of excluding or including all string columns?
Option 1
Using df.applymap
and type
, and equating to str
:
In [377]: (df.applymap(type) == str).all(0)
Out[377]:
dict_col False
int_col False
str_col True
dtype: bool
Each element in each column converted to its type and then equated to str
. After that, just call .all(0)
or .min(0)
to get a per-column verdict.
Option 2
Using df.applymap
and isinstance
:
In [342]: df.applymap(lambda x: isinstance(x, str)).all(0)
Out[342]:
dict_col False
int_col False
str_col True
To include these string columns, you can boolean index on the columns:
idx = ... # one of the two methods above
df_new = df[df.columns[idx]]
Exclusion would be
df_new = df[df.columns[~idx]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With