Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting spammers with MySQL

Tags:

php

mysql

spam

I see an ever increasing number of users signing up on my site to just send duplicate SPAM messages to other users. I've added some server side code to detect duplicate messages with the following mysql query:

  SELECT count(content) as msgs_sent 
    FROM messages 
   WHERE sender_id = '.$sender_id.' 
GROUP BY content having count(content) > 10

The query works well but now they're getting around this by changing a few charctersr in their messages. Is there a way to detect this with MySQL or do I need to look at each grouping returned from MySQL and then use PHP to determine the percentage of similarity?

Any thoughts or suggestions?

like image 367
Paul Avatar asked Oct 08 '22 16:10

Paul


1 Answers

Fulltext Match

You could look at implementing something similar to the MATCH example here:

mysql> SELECT id, body, MATCH (title,body) AGAINST
    -> ('Security implications of running MySQL as root') AS score
    -> FROM articles WHERE MATCH (title,body) AGAINST
    -> ('Security implications of running MySQL as root');
+----+-------------------------------------+-----------------+
| id | body                                | score           |
+----+-------------------------------------+-----------------+
|  4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
|  6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)

So for your example, perhaps:

SELECT id, MATCH (content) AGAINST ('your string') AS score
FROM messages 
WHERE MATCH (content) AGAINST ('your string')
    AND score > 1;

Note that to use these functions your content column would need to be a FULLTEXT index.

What is score in this example?

It is a relevance value. It is computed through the process described below:

Every correct word in the collection and in the query is weighted according to its significance in the collection or query. Consequently, a word that is present in many documents has a lower weight (and may even have a zero weight), because it has lower semantic value in this particular collection. Conversely, if the word is rare, it receives a higher weight. The weights of the words are combined to compute the relevance of the row.

From the documentation page.

like image 113
Michael Robinson Avatar answered Oct 12 '22 12:10

Michael Robinson