I'm cleaning up spam accounts in my forum, and found a bunch of email addresses with the following format:
[email protected]
[email protected]
[email protected]
Gmail treats these all as the same email account, versus the forum software treats them as distinct email addresses, so spammers use this trick to re-use the same email address again and again when creating spam forum accounts.
In order to identify them, I need to strip out all the periods before the @gmail.com
. Then it's easy to identify all the duplicate accounts.
Fortunately, MariaDB 10 has a new REGEXP_REPLACE
function designed for exactly these types of problems. Unfortunately, I can't figure out the correct regex.
My primary stumbling block is the number of periods varies drastically, and I'm not sure how to write a regex when the number of items will vary randomly throughout the string. I've found as many as 8 periods in one of these email addresses, totally random where in the string they'll show up.
It'd be easy if I could just strip out all periods but I can't because I need the @gmail.com
to stay untouched. Additionally this regex should only match on @gmail.com addresses and ignore other email providers.
How do I do this?
There's another trick with gmail addresses: Any text after a +
character is ignored, so e.g. [email protected]
and [email protected]
are effectively the same address.
You can use this pattern to remove all text after a +
character, as well as all dots (shamelessly based on Raj's pattern, please don't hate me):
(?:\.|\+.*)(?=.*?@gmail\.com)
(replace with the empty string)
regex101 demo.
Use positive lookahead assertion to match all the dots which are present before to the @gmail.com
\.(?=.*?@gmail\.com)
Then replace the matched dots with an empty string.
DEMO
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With