Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining individual probabilities in Naive Bayesian spam filtering

I'm currently trying to generate a spam filter by analyzing a corpus I've amassed.

I'm using the wikipedia entry http://en.wikipedia.org/wiki/Bayesian_spam_filtering to develop my classification code.

I've implemented code to calculate probability that a message is spam given that it contains a specific word by implementing the following formula from the wiki:

pr(S|W) = (pr(W|S)*pr(S))/(pr(W|S)*pr(S) + pr(W|H)*pr(H))

My PHP code:

public function pSpaminess($word)
{
    $ps = $this->pContentIsSpam();
    $ph = $this->pContentIsHam();
    $pws = $this->pWordInSpam($word);
    $pwh = $this->pWordInHam($word);
    $psw = ($pws * $ps) / ($pws * $ps + $pwh * $ph);
    return $psw;
}

In accordance with the Combining individual probabilities section, I've implemented code to combine the probabilities of all the unique words in a test message to determine spaminess.

From the wiki formula:

p=(p1*pn)/((p1*pn)+(1-p)(1-pn))

My PHP code:

public function predict($content)
{
    $words = $this->tokenize($content);
    $pProducts = 1;
    $pSums = 1;
    foreach($words as $word)
    {
        $p = $this->pSpaminess($word);
        echo "$word: $p\n";
        $pProducts *= $p;
        $pSums *= (1 - $p);
    }
    return $pProducts / ($pProducts + $pSums);
}

On a test string "This isn't very bad at all.", the following output is produced:

C:\projects\bayes>php test.php
this: 0.19907407407407
isn't: 0.23
very: 0.2
bad: 0.2906976744186
at: 0.17427385892116
all: 0.16098484848485
probability message is spam: float(0.00030795502523944)

Here's my question: Am I implementing the combining individual probabilities correctly? Assuming I'm generating valid individual word probabilities, is the combination method correct?

My concern is the really small resultant probability of the calculation. I've tested it on a larger test message and ended up with a resulting probability in scientific notation with more than 10 places of zeroes. I was expecting values in the 10s or 100ths places.

I'm hoping the problem lies in my PHP implementation--but when I examine the combination function from wikipedia the formula's dividend is a product of fractions. I don't see how a combination of multiple probabilities would end up being even more than .1% probability.

If it is the case, such that the longer the message the lower the probability score will be, how do I compensate the spaminess quota to correctly predict spam/ham for small and large test cases?


Additional Info

My corpus is actually a collection of about 40k reddit comments. I'm actually applying my "spam filter" against these comments. I'm rating an individual comment as spam/ham based on the number of down votes to up votes: If up votes is less than down votes it is considered Ham, otherwise Spam.

Now, because of the corpus type it turns out there are actually few words that are used in spam more so than in ham. Ie, here is a top ten list of words that appear in spam more often than ham.

+-----------+------------+-----------+
| word      | spam_count | ham_count |
+-----------+------------+-----------+
| krugman   |         30 |        27 |
| fetus     |       12.5 |       7.5 |
| boehner   |         12 |        10 |
| hatred    |       11.5 |       5.5 |
| scum      |         11 |        10 |
| reserve   |         11 |        10 |
| incapable |        8.5 |       6.5 |
| socalled  |        8.5 |       5.5 |
| jones     |        8.5 |       7.5 |
| orgasms   |        8.5 |       7.5 |
+-----------+------------+-----------+

On the contrary, most words are used in great abundance in ham more so than ham. Take for instance, my top 10 list of words with highest spam count.

+------+------------+-----------+
| word | spam_count | ham_count |
+------+------------+-----------+
| the  |       4884 |     17982 |
| to   |     4006.5 |   14658.5 |
| a    |     3770.5 |   14057.5 |
| of   |     3250.5 |   12102.5 |
| and  |       3130 |     11709 |
| is   |     3102.5 |   11032.5 |
| i    |     2987.5 |   10565.5 |
| that |     2953.5 |   10725.5 |
| it   |       2633 |      9639 |
| in   |     2593.5 |    9780.5 |
+------+------------+-----------+

As you can see, frequency of spam usage is significantly less than ham usage. In my corpus of 40k comments 2100 comments are considered spam.

As suggested below, a test phrase on a post considered spam rates as follows:

Phrase

Cops are losers in general. That's why they're cops.

Analysis:

C:\projects\bayes>php test.php
cops: 0.15833333333333
are: 0.2218958611482
losers: 0.44444444444444
in: 0.20959269435914
general: 0.19565217391304
that's: 0.22080730418068
why: 0.24539170506912
they're: 0.19264544456641
float(6.0865969793861E-5)

According to this, there is an extremely low probability that this is spam. However, if I were to now analyze a ham comment:

Phrase

Bill and TED's excellent venture?

Analysis

C:\projects\bayes>php test.php
bill: 0.19534050179211
and: 0.21093065570456
ted's: 1
excellent: 0.16091954022989
venture: 0.30434782608696
float(1)

Okay, this is interesting. I'm doing these examples as I'm composing this update so this is the first time I've seen the result for this specific test case. I think my prediction is inverted. Its actually picking out the probability of Ham instead of Spam. This deserves validation.

New test on known ham.

Phrase

Complain about $174,000 salary being too little for self.  Complain about $50,000 a year too much for teachers.
Scumbag congressman.

Analysis

C:\projects\bayes>php test.php
complain: 0.19736842105263
about: 0.21896031561847
174: 0.044117647058824
000: 0.19665809768638
salary: 0.20786516853933
being: 0.22011494252874
too: 0.21003236245955
little: 0.21134020618557
for: 0.20980452359022
self: 0.21052631578947
50: 0.19245283018868
a: 0.21149315683195
year: 0.21035386631717
much: 0.20139771283355
teachers: 0.21969696969697
scumbag: 0.22727272727273
congressman: 0.27678571428571
float(3.9604152477223E-11)

Unfortunately no. Turns out it was a coincidental result. I'm starting to wonder if perhaps comments can't be so easily quantified. Perhaps the nature of a bad comment is too vastly different than the nature of a spam message.

Perhaps it may be the case that spam filtering only works when you have a specific word class of spam messages?


Final Update

As pointed out in the replies, the weird results were due to the nature of the corpus. Using a comment corpus where there is not a an explicit definition of spam Bayesian classification does not perform. Since it is possible (and likely) that any one comment may receive both spam and ham ratings by various users it is not possible to generate a hard classification for spam comments.

Ultimately, I wanted to generate a comment classifier that could determine if a comment post would garnish karma based on a bayesian classification tuned to comment content. I may still investigate tuning the classifier to email spam messages and see if such a classifier can guess at karma response for comment systems. But for now, the question is answered. Thank you all for your input.

like image 956
Jeremy Giberson Avatar asked Jun 24 '11 05:06

Jeremy Giberson


People also ask

How is naive Bayes used in spam filtering?

Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam.

Why is naive Bayes good for spam classification?

Naive Bayes work on dependent events and the probability of an event occurring in the future that can be detected from the previous occurring of the same event . This technique can be used to classify spam e-mails, words probabilities play the main rule here.

Which algorithm is better choice for filtering spam?

Several machine learning algorithms have been used in spam e-mail filtering, but Naıve Bayes algorithm is particularly popular in commercial and open-source spam filters [2]. This is because of its simplicity, which make them easy to implement and just need short training time or fast evaluation to filter email spam.

Why linear regression is not suitable for spam filtering?

Reason 1: The hypothesis's range should be {0, 1} This brings us to the first folly of using linear regression to build a spam filter. The hypothesis should be a value 1 or 0 (spam or not spam, respectively). Yet linear regression allows it be any real number.


1 Answers

Varifying with only the calculator, it seems ok for the non-spam phrase you posted. In that case you have $pProducts a couple order of magnitudes smaller than $pSums.

Try running some real spam from your spam folder, where you'd meet probabilities like 0.8. And guess why spammers sometime try to send a piece of newspaper in a hidden frame along with the message :)

like image 186
meteor Avatar answered Oct 08 '22 18:10

meteor