Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for separating nonsense text from meaningful text

I provided some of my programs with a feedback function. Unfortunately I forgot to include some sort of spam-protection - so users could send anything they wanted to my server - where every feedback is stored in a huge db.

In the beginning I periodically checked those feedbacks - I filtered out what was usable and deleted garbage. The problem is: I get 900 feedbacks per day. Only 4-5 are really useful, the other messages are mostly 2 type of gibberish:

  • nonsense: jfvgasdjkfahs kdlfjhasdf (People smashing their heads on the keyboard)
  • language i don't understand

What I did so far:

  1. I installed a filter to delete any feedback containing "asdf", "qwer" etc... -> only 700 per day

  2. I installed a word filter to delte anything containing bad language -> 600 per day (don't ask - but there are many strange people out there)

  3. I filter out any messages containing letters not being used in my language -> 400 per day

But 400 per day is still way too much. So I'm wondering if anybody has dealt with such a problem before and knows some sort of algorithm to filter out senseless messages.

Any help would really be appreciated!

like image 833
Chris Avatar asked Feb 01 '09 22:02

Chris


2 Answers

How about just using some existing implementation of a bayesian spam filter instead of implementing your own. I have had good results with DSpam

like image 123
John Nilsson Avatar answered Sep 30 '22 20:09

John Nilsson


A slightly different approach would be to set up a system to email the feedback messages to an account and use standard spam filtering. You could send them through gmail and let their filtering take a shot at it. Not perfect, but not too much effort to implement either.

like image 30
Rob Walker Avatar answered Sep 30 '22 19:09

Rob Walker