I'm currently trying to generate a spam filter by analyzing a corpus I've amassed. I'm using the wikipedia entry http://en.wikipedia.org/wiki/Bayesian_spam_filtering to develop my classification code. I've implemented code to calculate probability that a message is spam given that it contains a specific word by implementing the following formula from the wiki: <img src="https://upload.wikimedia.org/math/c/c/c/ccc7bc9030475b2415e122ebe279d009.png" alt="pr(S|W) = (pr(W|S)*pr(S))/(pr(W|S)*pr(S) + pr(W|H)*pr(H))"> My PHP code: <pre class="prettyprint"><code>public function pSpaminess($word) { $ps = $this->pContentIsSpam(); $ph = $this->pContentIsHam(); $pws = $this->pWordInSpam($word); $pwh = $this->pWordInHam($word); $psw = ($pws * $ps) / ($pws * $ps + $pwh * $ph); return $psw; } </code></pre> In accordance with the Combining individual probabilities section, I've implemented code to combine the probabilities of all the unique words in a test message to determine spaminess. From the wiki formula: <img src="https://upload.wikimedia.org/math/f/1/d/f1d1c65ee72c294f1fc9b4eb156f5768.png" alt="p=(p1*pn)/((p1*pn)+(1-p)(1-pn))"> My PHP code: <pre class="prettyprint"><code>public function predict($content) { $words = $this->tokenize($content); $pProducts = 1; $pSums = 1; foreach($words as $word) { $p = $this->pSpaminess($word); echo "$word: $p\n"; $pProducts *= $p; $pSums *= (1 - $p); } return $pProducts / ($pProducts + $pSums); } </code></pre> On a test string "This isn't very bad at all.", the following output is produced: <pre class="prettyprint"><code>C:\projects\bayes>php test.php this: 0.19907407407407 isn't: 0.23 very: 0.2 bad: 0.2906976744186 at: 0.17427385892116 all: 0.16098484848485 probability message is spam: float(0.00030795502523944) </code></pre> Here's my question: Am I implementing the combining individual probabilities correctly? Assuming I'm generating valid individual word probabilities, is the combination method correct? My concern is the really small resultant probability of the calculation. I've tested it on a larger test message and ended up with a resulting probability in scientific notation with more than 10 places of zeroes. I was expecting values in the 10s or 100ths places. I'm hoping the problem lies in my PHP implementation--but when I examine the combination function from wikipedia the formula's dividend is a product of fractions. I don't see how a combination of multiple probabilities would end up being even more than .1% probability. If it is the case, such that the longer the message the lower the probability score will be, how do I compensate the spaminess quota to correctly predict spam/ham for small and large test cases? <hr> Additional Info My corpus is actually a collection of about 40k reddit comments. I'm actually applying my "spam filter" against these comments. I'm rating an individual comment as spam/ham based on the number of down votes to up votes: If up votes is less than down votes it is considered Ham, otherwise Spam. Now, because of the corpus type it turns out there are actually few words that are used in spam more so than in ham. Ie, here is a top ten list of words that appear in spam more often than ham. <pre class="prettyprint"><code>+-----------+------------+-----------+ | word | spam_count | ham_count | +-----------+------------+-----------+ | krugman | 30 | 27 | | fetus | 12.5 | 7.5 | | boehner | 12 | 10 | | hatred | 11.5 | 5.5 | | scum | 11 | 10 | | reserve | 11 | 10 | | incapable | 8.5 | 6.5 | | socalled | 8.5 | 5.5 | | jones | 8.5 | 7.5 | | orgasms | 8.5 | 7.5 | +-----------+------------+-----------+ </code></pre> On the contrary, most words are used in great abundance in ham more so than ham. Take for instance, my top 10 list of words with highest spam count. <pre class="prettyprint"><code>+------+------------+-----------+ | word | spam_count | ham_count | +------+------------+-----------+ | the | 4884 | 17982 | | to | 4006.5 | 14658.5 | | a | 3770.5 | 14057.5 | | of | 3250.5 | 12102.5 | | and | 3130 | 11709 | | is | 3102.5 | 11032.5 | | i | 2987.5 | 10565.5 | | that | 2953.5 | 10725.5 | | it | 2633 | 9639 | | in | 2593.5 | 9780.5 | +------+------------+-----------+ </code></pre> As you can see, frequency of spam usage is significantly less than ham usage. In my corpus of 40k comments 2100 comments are considered spam. As suggested below, a test phrase on a post considered spam rates as follows: Phrase <pre class="prettyprint"><code>Cops are losers in general. That's why they're cops. </code></pre> Analysis: <pre class="prettyprint"><code>C:\projects\bayes>php test.php cops: 0.15833333333333 are: 0.2218958611482 losers: 0.44444444444444 in: 0.20959269435914 general: 0.19565217391304 that's: 0.22080730418068 why: 0.24539170506912 they're: 0.19264544456641 float(6.0865969793861E-5) </code></pre> According to this, there is an extremely low probability that this is spam. However, if I were to now analyze a ham comment: Phrase <pre class="prettyprint"><code>Bill and TED's excellent venture? </code></pre> Analysis <pre class="prettyprint"><code>C:\projects\bayes>php test.php bill: 0.19534050179211 and: 0.21093065570456 ted's: 1 excellent: 0.16091954022989 venture: 0.30434782608696 float(1) </code></pre> Okay, this is interesting. I'm doing these examples as I'm composing this update so this is the first time I've seen the result for this specific test case. I think my prediction is inverted. Its actually picking out the probability of Ham instead of Spam. This deserves validation. New test on known ham. Phrase <pre class="prettyprint"><code>Complain about $174,000 salary being too little for self. Complain about $50,000 a year too much for teachers. Scumbag congressman. </code></pre> Analysis <pre class="prettyprint"><code>C:\projects\bayes>php test.php complain: 0.19736842105263 about: 0.21896031561847 174: 0.044117647058824 000: 0.19665809768638 salary: 0.20786516853933 being: 0.22011494252874 too: 0.21003236245955 little: 0.21134020618557 for: 0.20980452359022 self: 0.21052631578947 50: 0.19245283018868 a: 0.21149315683195 year: 0.21035386631717 much: 0.20139771283355 teachers: 0.21969696969697 scumbag: 0.22727272727273 congressman: 0.27678571428571 float(3.9604152477223E-11) </code></pre> Unfortunately no. Turns out it was a coincidental result. I'm starting to wonder if perhaps comments can't be so easily quantified. Perhaps the nature of a bad comment is too vastly different than the nature of a spam message. Perhaps it may be the case that spam filtering only works when you have a specific word class of spam messages? <hr> Final Update As pointed out in the replies, the weird results were due to the nature of the corpus. Using a comment corpus where there is not a an explicit definition of spam Bayesian classification does not perform. Since it is possible (and likely) that any one comment may receive both spam and ham ratings by various users it is not possible to generate a hard classification for spam comments. Ultimately, I wanted to generate a comment classifier that could determine if a comment post would garnish karma based on a bayesian classification tuned to comment content. I may still investigate tuning the classifier to email spam messages and see if such a classifier can guess at karma response for comment systems. But for now, the question is answered. Thank you all for your input.

Varifying with only the calculator, it seems ok for the non-spam phrase you posted. In that case you have $pProducts a couple order of magnitudes smaller than $pSums. Try running some real spam from your spam folder, where you'd meet probabilities like 0.8. And guess why spammers sometime try to send a piece of newspaper in a hidden frame along with the message :)

Combining individual probabilities in Naive Bayesian spam filtering

Tags:

php

probability

spam-prevention

I'm currently trying to generate a spam filter by analyzing a corpus I've amassed.

I'm using the wikipedia entry http://en.wikipedia.org/wiki/Bayesian_spam_filtering to develop my classification code.

I've implemented code to calculate probability that a message is spam given that it contains a specific word by implementing the following formula from the wiki:

$pr(S|W) = (pr(W|S)*pr(S))/(pr(W|S)*pr(S) + pr(W|H)*pr(H))$

My PHP code:

public function pSpaminess($word)
{
    $ps = $this->pContentIsSpam();
    $ph = $this->pContentIsHam();
    $pws = $this->pWordInSpam($word);
    $pwh = $this->pWordInHam($word);
    $psw = ($pws * $ps) / ($pws * $ps + $pwh * $ph);
    return $psw;
}

In accordance with the Combining individual probabilities section, I've implemented code to combine the probabilities of all the unique words in a test message to determine spaminess.

From the wiki formula:

$p=(p1*pn)/((p1*pn)+(1-p)(1-pn))$

My PHP code:

public function predict($content)
{
    $words = $this->tokenize($content);
    $pProducts = 1;
    $pSums = 1;
    foreach($words as $word)
    {
        $p = $this->pSpaminess($word);
        echo "$word: $p\n";
        $pProducts *= $p;
        $pSums *= (1 - $p);
    }
    return $pProducts / ($pProducts + $pSums);
}

On a test string "This isn't very bad at all.", the following output is produced:

C:\projects\bayes>php test.php
this: 0.19907407407407
isn't: 0.23
very: 0.2
bad: 0.2906976744186
at: 0.17427385892116
all: 0.16098484848485
probability message is spam: float(0.00030795502523944)

Here's my question: Am I implementing the combining individual probabilities correctly? Assuming I'm generating valid individual word probabilities, is the combination method correct?

My concern is the really small resultant probability of the calculation. I've tested it on a larger test message and ended up with a resulting probability in scientific notation with more than 10 places of zeroes. I was expecting values in the 10s or 100ths places.

I'm hoping the problem lies in my PHP implementation--but when I examine the combination function from wikipedia the formula's dividend is a product of fractions. I don't see how a combination of multiple probabilities would end up being even more than .1% probability.

If it is the case, such that the longer the message the lower the probability score will be, how do I compensate the spaminess quota to correctly predict spam/ham for small and large test cases?

Additional Info

My corpus is actually a collection of about 40k reddit comments. I'm actually applying my "spam filter" against these comments. I'm rating an individual comment as spam/ham based on the number of down votes to up votes: If up votes is less than down votes it is considered Ham, otherwise Spam.

Now, because of the corpus type it turns out there are actually few words that are used in spam more so than in ham. Ie, here is a top ten list of words that appear in spam more often than ham.

+-----------+------------+-----------+
| word      | spam_count | ham_count |
+-----------+------------+-----------+
| krugman   |         30 |        27 |
| fetus     |       12.5 |       7.5 |
| boehner   |         12 |        10 |
| hatred    |       11.5 |       5.5 |
| scum      |         11 |        10 |
| reserve   |         11 |        10 |
| incapable |        8.5 |       6.5 |
| socalled  |        8.5 |       5.5 |
| jones     |        8.5 |       7.5 |
| orgasms   |        8.5 |       7.5 |
+-----------+------------+-----------+

On the contrary, most words are used in great abundance in ham more so than ham. Take for instance, my top 10 list of words with highest spam count.

+------+------------+-----------+
| word | spam_count | ham_count |
+------+------------+-----------+
| the  |       4884 |     17982 |
| to   |     4006.5 |   14658.5 |
| a    |     3770.5 |   14057.5 |
| of   |     3250.5 |   12102.5 |
| and  |       3130 |     11709 |
| is   |     3102.5 |   11032.5 |
| i    |     2987.5 |   10565.5 |
| that |     2953.5 |   10725.5 |
| it   |       2633 |      9639 |
| in   |     2593.5 |    9780.5 |
+------+------------+-----------+

As you can see, frequency of spam usage is significantly less than ham usage. In my corpus of 40k comments 2100 comments are considered spam.

As suggested below, a test phrase on a post considered spam rates as follows:

Phrase

Cops are losers in general. That's why they're cops.

Analysis:

C:\projects\bayes>php test.php
cops: 0.15833333333333
are: 0.2218958611482
losers: 0.44444444444444
in: 0.20959269435914
general: 0.19565217391304
that's: 0.22080730418068
why: 0.24539170506912
they're: 0.19264544456641
float(6.0865969793861E-5)

According to this, there is an extremely low probability that this is spam. However, if I were to now analyze a ham comment:

Phrase

Bill and TED's excellent venture?

Analysis

C:\projects\bayes>php test.php
bill: 0.19534050179211
and: 0.21093065570456
ted's: 1
excellent: 0.16091954022989
venture: 0.30434782608696
float(1)

Okay, this is interesting. I'm doing these examples as I'm composing this update so this is the first time I've seen the result for this specific test case. I think my prediction is inverted. Its actually picking out the probability of Ham instead of Spam. This deserves validation.

New test on known ham.

Phrase

Complain about $174,000 salary being too little for self.  Complain about $50,000 a year too much for teachers.
Scumbag congressman.

Analysis

C:\projects\bayes>php test.php
complain: 0.19736842105263
about: 0.21896031561847
174: 0.044117647058824
000: 0.19665809768638
salary: 0.20786516853933
being: 0.22011494252874
too: 0.21003236245955
little: 0.21134020618557
for: 0.20980452359022
self: 0.21052631578947
50: 0.19245283018868
a: 0.21149315683195
year: 0.21035386631717
much: 0.20139771283355
teachers: 0.21969696969697
scumbag: 0.22727272727273
congressman: 0.27678571428571
float(3.9604152477223E-11)

Unfortunately no. Turns out it was a coincidental result. I'm starting to wonder if perhaps comments can't be so easily quantified. Perhaps the nature of a bad comment is too vastly different than the nature of a spam message.

Perhaps it may be the case that spam filtering only works when you have a specific word class of spam messages?

Final Update

As pointed out in the replies, the weird results were due to the nature of the corpus. Using a comment corpus where there is not a an explicit definition of spam Bayesian classification does not perform. Since it is possible (and likely) that any one comment may receive both spam and ham ratings by various users it is not possible to generate a hard classification for spam comments.

Ultimately, I wanted to generate a comment classifier that could determine if a comment post would garnish karma based on a bayesian classification tuned to comment content. I may still investigate tuning the classifier to email spam messages and see if such a classifier can guess at karma response for comment systems. But for now, the question is answered. Thank you all for your input.

956

asked Jun 24 '11 05:06

Jeremy Giberson

1 Answers

Varifying with only the calculator, it seems ok for the non-spam phrase you posted. In that case you have $pProducts a couple order of magnitudes smaller than $pSums.

Try running some real spam from your spam folder, where you'd meet probabilities like 0.8. And guess why spammers sometime try to send a piece of newspaper in a hidden frame along with the message :)

186

answered Oct 08 '22 18:10

meteor

Related questions
                            
                                How to prevent Magento to overwrite attribute values from another Website / Store while updating products programmatically
                            
                                How to set session time out in PHP [duplicate]
                            
                                CakePHP subdomains with htaccess
                            
                                What's wrong with my appcache manifest test?
                            
                                Not able to make xmpp connection between my app server and GCM's CCS using XMPPHP library
                            
                                Symfony2 WebSocketBundle - ZMQ Push not working
                            
                                Update SQL Query with populated variables from AJAX functions over multiple PHP Pages
                            
                                PHP: Array as var_export/include vs. (un)serialize vs. json_(en|de)code
                            
                                Zend_Gdata and OAuth
                            
                                What is the best way to create a whois lookup? [closed]
                            
                                Is it possible to skip some translation when using joomfish?
                            
                                Zend Framework 2 with zfc-rbac database population
                            
                                Intervention Image - how to upload images with MIME type: application/octet-stream
                            
                                Inheritance on doctrine's embeddables
                            
                                Why php5-fpm post requests are slow, while same php-cli code/console curl works extremely fast?
                            
                                Starting FOREVER or PM2 as WWW-DATA from a PHP script
                            
                                Passing additional variable to partial using @each in blade
                            
                                SQL Injection Detection - Have compiled regexes... looking for test injections
                            
                                Can't use Doctrine PersistentCollection for one of my entities, another I can
                            
                                What is the complete list of valid file extensions for images?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Combining individual probabilities in Naive Bayesian spam filtering

Tags:

php

probability

spam-prevention

Jeremy Giberson

People also ask

1 Answers

meteor

Recent Activity

Donate For Us