Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: ignore extra characters

Tags:

regex

I'm trying to figure out how to detect extra characters within a spam word like:

pha.rmacy or vi*agra

any ideas?

like image 508
Fuxi Avatar asked Mar 24 '10 23:03

Fuxi


2 Answers

You could use a (dis)similarity metric, such as edit distance. For instance, the edit distance between vi.agra and viagra is 1.

Then, you determine that a given word is the same as the spam word, if the edit distance between them is below a certain threshold like, say, 2.

But if you really want to use a regex, you can use something like /[^a-zA-Z0-9-\s]/ to remove punctuation from the word. But then again, you would fail to identify something like viZagra as being the same word as viagra.

like image 71
João Silva Avatar answered Oct 20 '22 00:10

João Silva


Regular expressions do not seem like the appropriate tool for figuring this out. But as an attempt to answer it just because it is interesting, a simple way would be to do something like this:

/v.?i.?a.?g.?r.?a/

It would match 0 or 1 characters between each letter.

like image 26
Mark Wilkins Avatar answered Oct 19 '22 22:10

Mark Wilkins