Calculating the probability of a token being spam in a Bayesian spam filter

Tags:

I recently wrote a Bayesian spam filter, I used Paul Graham's article Plan for Spam and an implementation of it in C# I found on codeproject as references to create my own filter.

I just noticed that the implementation on CodeProject uses the total number of unique tokens in calculating the probability of a token being spam (e.g. if the ham corpus contains 10000 tokens in total but 1500 unqiue tokens, the 1500 is used in calculating the probability as ngood), but in my implementation I used the number of posts as mentioned in Paul Graham's article, this makes me wonder which one of these should be better in calculating the probability:

Post count (as mentioned in Paul Graham's article)
Total unique token count (as used in the implementation on codeproject)
Total token count
Total included token count (ie. those tokens with b + g >= 5)
Total unique included token count

267

asked Apr 06 '09 01:04

Waleed Eissa

1 Answers

This EACL paper by Karl-Michael Schneider(PDF) says you should use the multinomial model, meaning the total token count, for calculating the probability. Please see the paper for the exact calculations.

answered Oct 03 '22 16:10

Yuval F

Related questions
                            
                                How to detect that rollback has occurred?
                            
                                Expression.Call causes "Static method requires null instance, non-static method requires non-null instance"
                            
                                Dependency injection injecting null when missing registration in Azure functions
                            
                                Repository Pattern EF Core Update method
                            
                                How to cache DataContext instances in a consumer type application?
                            
                                Failed executing DbCommand because timeout expired .net core
                            
                                Why does adding throw inside a lambda without a return value get inferred as a Func<T> and not as Action? [duplicate]
                            
                                Certificate Authentication Implementation in ASP.NET Core 3.1
                            
                                Proper nullable annotation for async generic method that may return default(T)
                            
                                Can pointers be used to modify readonly field? But why?
                            
                                Unable to debug dotnet core GenericHost docker container
                            
                                ContentHash is null in Azure.Storage.Blobs v12.x.x
                            
                                Copying OLE Objects from one slide to another corrupts the resulting PowerPoint
                            
                                Bind IConfiguration to C# Record Type
                            
                                Serialize Json object with "multi-type" property
                            
                                Unit testing IHttpModule
                            
                                How do you execute a stored procedure using Castle ActiveRecord?
                            
                                Is there a library for notification/alert in .NET?
                            
                                Can an App.Config be loaded from a string or memory stream?
                            
                                T4 template for NHibernate? - not Fluent NHibernate

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculating the probability of a token being spam in a Bayesian spam filter

Tags:

c#

algorithm

spam-prevention

bayesian

Waleed Eissa

People also ask

1 Answers

Yuval F

Recent Activity

Donate For Us