Algorithm for separating nonsense text from meaningful text

Question

I provided some of my programs with a feedback function. Unfortunately I forgot to include some sort of spam-protection - so users could send anything they wanted to my server - where every feedback is stored in a huge db.

In the beginning I periodically checked those feedbacks - I filtered out what was usable and deleted garbage. The problem is: I get 900 feedbacks per day. Only 4-5 are really useful, the other messages are mostly 2 type of gibberish:

nonsense: jfvgasdjkfahs kdlfjhasdf (People smashing their heads on the keyboard)
language i don't understand

What I did so far:

I installed a filter to delete any feedback containing "asdf", "qwer" etc... -> only 700 per day
I installed a word filter to delte anything containing bad language -> 600 per day (don't ask - but there are many strange people out there)
I filter out any messages containing letters not being used in my language -> 400 per day

But 400 per day is still way too much. So I'm wondering if anybody has dealt with such a problem before and knows some sort of algorithm to filter out senseless messages.

Any help would really be appreciated!

John Nilsson · Accepted Answer

How about just using some existing implementation of a bayesian spam filter instead of implementing your own. I have had good results with DSpam

Rob Walker · Answer

A slightly different approach would be to set up a system to email the feedback messages to an account and use standard spam filtering. You could send them through gmail and let their filtering take a shot at it. Not perfect, but not too much effort to implement either.

Algorithm for separating nonsense text from meaningful text

Tags:

algorithm

filter

nlp

word

spam

Chris

2 Answers

John Nilsson

Rob Walker

Recent Activity

Donate For Us

Algorithm for separating nonsense text from meaningful text

Tags:

algorithm

filter

nlp

word

spam

Chris

2 Answers

John Nilsson

Rob Walker

Related questions

Recent Activity

Donate For Us