I recently wrote a Bayesian spam filter, I used Paul Graham's article Plan for Spam and an implementation of it in C# I found on codeproject as references to create my own filter.
I just noticed that the implementation on CodeProject uses the total number of unique tokens in calculating the probability of a token being spam (e.g. if the ham corpus contains 10000 tokens in total but 1500 unqiue tokens, the 1500 is used in calculating the probability as ngood), but in my implementation I used the number of posts as mentioned in Paul Graham's article, this makes me wonder which one of these should be better in calculating the probability:
This computes the probability of any one email being spam, by dividing the total number of spam emails by the total number of all emails.
Because the spam filter uses a bayesian approach we can achieve this by multiplying the probabilities for every word together and dividing by the combined probability of every word for being in a spam message plus the combined probability of every word for not being in a spam message.
A Bayesian filter works by comparing your incoming email with a database of emails, which are categorised into 'spam' and 'not spam'. Bayes' theorem is used to learn from these prior messages. Then, the filter can calculate a spam probability score against each new message entering your inbox.
Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam.
This EACL paper by Karl-Michael Schneider(PDF) says you should use the multinomial model, meaning the total token count, for calculating the probability. Please see the paper for the exact calculations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With