I have a Pandas dataframe called df with the following 3 columns: id, creation_date and email.
I want to return all rows where the email column contains any strictly numeric combination (must be strictly numbers) followed by a 'plus' sign and then followed by anything.
For example:
- [email protected], [email protected] will meet my criteria.
- [email protected] and [email protected] will not, because they contain non-numeric characters before the 'plus' sign.
I know df.email.str.contains('\+') won't work because it will return everything that contains a 'plus' sign. I had tried df.filter(['email'], regex=r'([^0-9])' % '\+', axis=0) but it threw an error message that read TypeError: not all arguments converted during string formatting.
Can anyone advise?
Thanks very much!
You can use contains, but match should be sufficient:
# example data
data = ["[email protected]", "[email protected]",
"[email protected]", "[email protected]"]
df = pd.DataFrame(data, columns=["email"])
df
email
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
Now use match:
df.email.str.match("\d+\+.*")
0 True
1 True
2 False
3 False
Name: email, dtype: bool
Note the difference between contains and match, from the docs:
contains
analogous, but less strict, relying on re.search instead of re.match
Try this:
df.email.str.contains('^\d+\+\@')
In breaking down the regular expression:
^ ensures that we are starting at the beginning of the email string
\d+ captures only digit (numeric) character 1 to many times
\+ escapes the plus sign to ensure a match
\@ escapes the @ and ensures that the plus sign previously matched occurs at the end of the email just prior to the @
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With