Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Publicly Available Spam Filter Training Set [closed]

I'm new to machine learning, and for my first project I'd like to write a naive Bayes spam filter. I was wondering if there are any publicly available training sets of labeled spam/not spam emails, preferably in plain text and not a dump of a relational database (unless they pretty-print those?).

I know such a publicly available database exists for other kinds of text classification, specifically news article text. I just haven't been able to find the same sort of thing for emails.

like image 391
JeremyKun Avatar asked Jan 20 '11 06:01

JeremyKun


People also ask

Which algorithm would be good for email spam detection if you have sufficient training set?

SVM algorithms are very potent for the identification of patterns and classifying them into a specific class or group. They can be easily trained and according to some researchers, they outperform many of the popular email spam classification methods [130,131].

What is spam filter in machine learning?

In machine learning, spam filtering protocols use instance-based or memory-based learning methods to identify and classify incoming spam emails based on their resemblance to stored training examples of spam emails. See also email virus, ingress filtering, egress filtering, filter, firewall and phishing.

Is spam filtering supervised or unsupervised?

Spam filtering has traditionally relied on extracting spam signatures via supervised learning, i.e., using emails explic- itly manually labeled as spam or ham. Such supervised learn- ing is labor-intensive and costly, more importantly cannot adapt to new spamming behavior quickly enough.

What is the best algorithm for spam filtering?

Several machine learning algorithms have been used in spam e-mail filtering, but Naıve Bayes algorithm is particularly popular in commercial and open-source spam filters [2]. This is because of its simplicity, which make them easy to implement and just need short training time or fast evaluation to filter email spam.


1 Answers

Here is what I was looking for: http://untroubled.org/spam/

This archive has around a gigabyte of compressed accumulated spam messages dating 1998 - 2011. Now I just need to get non-spam email. So I'll just query my own Gmail for that using the getmail program and the tutorial at mattcutts.com

like image 173
JeremyKun Avatar answered Nov 23 '22 02:11

JeremyKun