If you would like to filter those rows for which a string is in a column value, it is possible to use something like data.sample_id.str.contains('hph')
(answered before: check if string in pandas dataframe column is in list, or Check if string is in a pandas dataframe).
However, my lookup column contains emtpy cells. Terefore, str.contains()
yields NaN
values and I get an value error upon indexing.
`ValueError: cannot index with vector containing NA / NaN values``
What works:
# get all runs
mask = [index for index, item in enumerate(data.sample_id.values) if 'zent' in str(item)]
Is there a more elegant and faster method (similar to str.contains()
) than this one ?
You can set parameter na
in str.contains
to False
:
print (df.a.str.contains('hph', na=False))
Using EdChum
sample:
df = pd.DataFrame({'a':['hph', np.NaN, 'sadhphsad', 'hello']})
print (df)
a
0 hph
1 NaN
2 sadhphsad
3 hello
print (df.a.str.contains('hph', na=False))
0 True
1 False
2 True
3 False
Name: a, dtype: bool
IIUC you can filter those rows out also
data['sample'].dropna().str.contains('hph')
Example:
In [38]:
df =pd.DataFrame({'a':['hph', np.NaN, 'sadhphsad', 'hello']})
df
Out[38]:
a
0 hph
1 NaN
2 sadhphsad
3 hello
In [39]:
df['a'].dropna().str.contains('hph')
Out[39]:
0 True
2 True
3 False
Name: a, dtype: bool
So by calling dropna
first you can then safely use str.contains
on the Series
as there will be no NaN
values
Another way to handle the null values would be to use notnull
:
In [43]:
(df['a'].notnull()) & (df['a'].str.contains('hph'))
Out[43]:
0 True
1 False
2 True
3 False
Name: a, dtype: bool
but I think passing na=False
would be cleaner (@jezrael's answer)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With