Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating the probability of a token being spam in a Bayesian spam filter

I recently wrote a Bayesian spam filter, I used Paul Graham's article Plan for Spam and an implementation of it in C# I found on codeproject as references to create my own filter.

I just noticed that the implementation on CodeProject uses the total number of unique tokens in calculating the probability of a token being spam (e.g. if the ham corpus contains 10000 tokens in total but 1500 unqiue tokens, the 1500 is used in calculating the probability as ngood), but in my implementation I used the number of posts as mentioned in Paul Graham's article, this makes me wonder which one of these should be better in calculating the probability:

  1. Post count (as mentioned in Paul Graham's article)
  2. Total unique token count (as used in the implementation on codeproject)
  3. Total token count
  4. Total included token count (ie. those tokens with b + g >= 5)
  5. Total unique included token count
like image 267
Waleed Eissa Avatar asked Apr 06 '09 01:04

Waleed Eissa


People also ask

What is Bayes spam probability?

This computes the probability of any one email being spam, by dividing the total number of spam emails by the total number of all emails.

How would you use Bayes theorem for spam detection?

Because the spam filter uses a bayesian approach we can achieve this by multiplying the probabilities for every word together and dividing by the combined probability of every word for being in a spam message plus the combined probability of every word for not being in a spam message.

How does Bayesian spam filter work?

A Bayesian filter works by comparing your incoming email with a database of emails, which are categorised into 'spam' and 'not spam'. Bayes' theorem is used to learn from these prior messages. Then, the filter can calculate a spam probability score against each new message entering your inbox.

How does naïve Bayes compute the probability of an e mail belonging to a class spam not spam )?

Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam.


1 Answers

This EACL paper by Karl-Michael Schneider(PDF) says you should use the multinomial model, meaning the total token count, for calculating the probability. Please see the paper for the exact calculations.

like image 63
Yuval F Avatar answered Oct 03 '22 16:10

Yuval F