I have a pandas DataFrame (df) that I need to search for a semicolon. I first tried with
semicolon_check = df.to_string().__contains__(';'),
but it is very slow and in case of large DataFrames I run into a Memory error. Then I tried to loop over columns with .str, but not all columns are strings so whenever I reached a numeric column I received an error
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
so I ended up with this code
for col in df.columns:
if df[col].dtype == 'O':
if df[col].str.contains(r';').any():
print 'found in ' + col
is there an easier way to achieve the goal? The above, although working as expected seems like a bit too much of an effort for such an elementary task like value search.
You can filter just strings columns using select_dtypes and then call apply and pass a lambda to call str.contains with any:
In [33]:
# create a test df
df = pd.DataFrame({'int':np.arange(5), 'str':['a','a;a',';','b','c'], 'flt':np.random.randn(5), 'other str':list('abcde')})
df
Out[33]:
flt int other str str
0 1.020561 0 a a
1 0.022842 1 b a;a
2 -1.207961 2 c ;
3 1.092960 3 d b
4 -1.560300 4 e c
In [35]:
# filter on dtype
test = df.select_dtypes([np.object]).apply(lambda x: x.str.contains(';').any())
test
Out[35]:
other str False
str True
dtype: bool
We can use the columns array from the filtered df along with the mask to filter the cols:
In [36]:
# we can use the above to mask the columns
str_cols = df.select_dtypes([np.object]).columns
str_cols[test]
Out[36]:
Index(['str'], dtype='object')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With