If you would like to filter those rows for which a string is in a column value, it is possible to use something like data.sample_id.str.contains('hph') (answered before: check if string in pandas dataframe column is in list, or Check if string is in a pandas dataframe).
However, my lookup column contains emtpy cells. Terefore, str.contains() yields NaN values and I get an value error upon indexing.
`ValueError: cannot index with vector containing NA / NaN values``
What works:
# get all runs
mask = [index for index, item in enumerate(data.sample_id.values) if 'zent' in str(item)]
Is there a more elegant and faster method (similar to str.contains()) than this one ?
You can set parameter na in str.contains to False:
print (df.a.str.contains('hph', na=False))
Using EdChum sample:
df = pd.DataFrame({'a':['hph', np.NaN, 'sadhphsad', 'hello']})
print (df)
a
0 hph
1 NaN
2 sadhphsad
3 hello
print (df.a.str.contains('hph', na=False))
0 True
1 False
2 True
3 False
Name: a, dtype: bool
IIUC you can filter those rows out also
data['sample'].dropna().str.contains('hph')
Example:
In [38]:
df =pd.DataFrame({'a':['hph', np.NaN, 'sadhphsad', 'hello']})
df
Out[38]:
a
0 hph
1 NaN
2 sadhphsad
3 hello
In [39]:
df['a'].dropna().str.contains('hph')
Out[39]:
0 True
2 True
3 False
Name: a, dtype: bool
So by calling dropna first you can then safely use str.contains on the Series as there will be no NaN values
Another way to handle the null values would be to use notnull:
In [43]:
(df['a'].notnull()) & (df['a'].str.contains('hph'))
Out[43]:
0 True
1 False
2 True
3 False
Name: a, dtype: bool
but I think passing na=False would be cleaner (@jezrael's answer)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With