I have a Pandas dataframe called df
with the following 3 columns: id
, creation_date
and email
.
I want to return all rows where the email
column contains any strictly numeric combination (must be strictly numbers) followed by a 'plus' sign and then followed by anything.
For example:
- [email protected]
, [email protected]
will meet my criteria.
- [email protected]
and [email protected]
will not, because they contain non-numeric characters before the 'plus' sign.
I know df.email.str.contains('\+')
won't work because it will return everything that contains a 'plus' sign. I had tried df.filter(['email'], regex=r'([^0-9])' % '\+', axis=0)
but it threw an error message that read TypeError: not all arguments converted during string formatting
.
Can anyone advise?
Thanks very much!
You can use contains
, but match
should be sufficient:
# example data
data = ["[email protected]", "[email protected]",
"[email protected]", "[email protected]"]
df = pd.DataFrame(data, columns=["email"])
df
email
0 [email protected]
1 [email protected]
2 [email protected]
3 [email protected]
Now use match
:
df.email.str.match("\d+\+.*")
0 True
1 True
2 False
3 False
Name: email, dtype: bool
Note the difference between contains
and match
, from the docs:
contains
analogous, but less strict, relying on re.search instead of re.match
Try this:
df.email.str.contains('^\d+\+\@')
In breaking down the regular expression:
^
ensures that we are starting at the beginning of the email string
\d+
captures only digit (numeric) character 1 to many times
\+
escapes the plus sign to ensure a match
\@
escapes the @ and ensures that the plus sign previously matched occurs at the end of the email just prior to the @
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With