Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Naive Bayesian spam filtering effectiveness

How effective is naive Bayesian filtering for filtering spam?

I heard that spammers easily bypass them by stuffing extra non-spam-related words. What programming techniques can you use with Bayesian filters to prevent that?

like image 283
Waleed Eissa Avatar asked Dec 12 '08 04:12

Waleed Eissa


People also ask

Why Naive Bayes is best for spam filtering?

Naive Bayes spam filtering is a baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low false positive spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering, with roots in the 1990s.

What are the limitations of using naive Bayes algorithm to detect spam?

Disadvantages: Naive bayes is based on the conditional independence of features assumption – an assumption that is not valid in many real world scenarios. Hence it sometimes oversimplifies the problem by saying features are independant and gives sub par performance.

Which algorithm is better choice for filtering spam?

The first approach that I take was to use the TfidfVectorizer as a feature extraction tools and Naive Bayes algorithm to do the prediction. Naive Bayes is a simple and a probabilistic traditional machine learning algorithm. It is very popular even in the past in solving problems like spam detection.

What is the accuracy of Naive Bayes?

Accuracy in naive bayes classification is 100%


2 Answers

Paul Graham was the guy to really introduce the idea of using Bayesian spam filtering to the web at large with his original article A Plan for Spam, back in August 2002. Then, his follow-up a year or so later introduced many of the problems that swiftly arose. These are still pretty great works on the topic.

In the second article, Graham mentions using CRM114, which works on a much wider set of patterns than just space-delimited words. CRM114 is cool, but comes without much implementation help for a spam filtering system.

There's the open-source powertools for Bayesian spam filtering like Death2Spam and SpamProbe.

I find nothing works quite like filtering mail through a Gmail account. Happy hunting.

like image 191
danieltalsky Avatar answered Nov 18 '22 19:11

danieltalsky


I think for defeating the kind of spam attack you mention, the important thing is not the learning method but rather what features you train on. I use Fidelis Assis's OSBF-Lua which is a very successful filter: it keeps winning contests for spam filters. It uses Bayesian learning but I think the real reason for its success is three principles:

  • It trains not on single words but on sparse bigrams: a pair of words separated by 0 to 4 "don't care" words. The spammers have to put their message in somewhere and the sparse bigrams are very good at sussing them out. It even finds attachement spam!

  • It does extra training on message headers, because these are hard for spammers to disguise. Example: a message that originates on your network and never passes through an off-network relay host is probably not spam.

  • If the spam filter has low confidence about its classification, it requests input from a human. (In practice it adds a header field saying "please train me on this message"; the human can ignore the request.) This means that as the spammers evolve new techniques, your filter evolves to match.

This combination of techniques is extremely effective.

Disclaimer: I have worked with Fidelis on refactoring some of the software so that it can be used for other purposes such as classifying regular mail into groups or possibly one day trying to detect spam in blog comments and other places.

like image 22
Norman Ramsey Avatar answered Nov 18 '22 18:11

Norman Ramsey