pandas

Question

I have a pandas DataFrame (df) that I need to search for a semicolon. I first tried with

semicolon_check = df.to_string().__contains__(';'),

but it is very slow and in case of large DataFrames I run into a Memory error. Then I tried to loop over columns with .str, but not all columns are strings so whenever I reached a numeric column I received an error

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

so I ended up with this code

for col in df.columns: if df[col].dtype == 'O': if df[col].str.contains(r';').any(): print 'found in ' + col

is there an easier way to achieve the goal? The above, although working as expected seems like a bit too much of an effort for such an elementary task like value search.

EdChum · Accepted Answer

You can filter just strings columns using select_dtypes and then call apply and pass a lambda to call str.contains with any:

In [33]:
# create a test df
df = pd.DataFrame({'int':np.arange(5), 'str':['a','a;a',';','b','c'], 'flt':np.random.randn(5), 'other str':list('abcde')})
df

Out[33]:
        flt  int other str  str
0  1.020561    0         a    a
1  0.022842    1         b  a;a
2 -1.207961    2         c    ;
3  1.092960    3         d    b
4 -1.560300    4         e    c

In [35]:
# filter on dtype
test = df.select_dtypes([np.object]).apply(lambda x: x.str.contains(';').any())
test

Out[35]:
other str    False
str           True
dtype: bool

We can use the columns array from the filtered df along with the mask to filter the cols:

In [36]:
# we can use the above to mask the columns
str_cols = df.select_dtypes([np.object]).columns
str_cols[test]

Out[36]:
Index(['str'], dtype='object')

pandas - searching for a character in a DataFrame

Tags:

python

MJB

1 Answers

EdChum

Recent Activity

Donate For Us