Say I have a dataframe my_df
with a column 'brand'
, I would like to drop any rows where brand is either toyota
or bmw
.
I thought the following would do it:
my_regex = re.compile('^(bmw$|toyota$).*$')
my_function = lambda x: my_regex.match(x.lower())
my_df[~df['brand'].apply(my_function)]
but I get the error:
ValueError: cannot index with vector containing NA / NaN values
Why? How can I filter my DataFrame using a regex?
I think re.match
returns None
when there is no match and that breaks the indexing; below is an alternative solution using pandas vectorized string methods; note that pandas string methods can handle null values:
>>> df = pd.DataFrame( {'brand':['BMW', 'FORD', np.nan, None, 'TOYOTA', 'AUDI']})
>>> df
brand
0 BMW
1 FORD
2 NaN
3 None
4 TOYOTA
5 AUDI
[6 rows x 1 columns]
>>> idx = df.brand.str.contains('^bmw$|^toyota$',
flags=re.IGNORECASE, regex=True, na=False)
>>> idx
0 True
1 False
2 False
3 False
4 True
5 False
Name: brand, dtype: bool
>>> df[~idx]
brand
1 FORD
2 NaN
3 None
5 AUDI
[4 rows x 1 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With