I'm trying to figure out how to detect extra characters within a spam word like:
pha.rmacy
or vi*agra
any ideas?
You could use a (dis)similarity metric, such as edit distance. For instance, the edit distance between vi.agra and viagra is 1.
Then, you determine that a given word is the same as the spam word, if the edit distance between them is below a certain threshold like, say, 2.
But if you really want to use a regex, you can use something like /[^a-zA-Z0-9-\s]/
to remove punctuation from the word. But then again, you would fail to identify something like viZagra
as being the same word as viagra
.
Regular expressions do not seem like the appropriate tool for figuring this out. But as an attempt to answer it just because it is interesting, a simple way would be to do something like this:
/v.?i.?a.?g.?r.?a/
It would match 0 or 1 characters between each letter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With