Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which spam corpus I can use in NLTK?

My question is fairly related to this one, but I decided to open another question thread. I hope it is fine.

I am building a spam filter using the NLTK in Python as well, but I've just started.

I am wondering which spam corpus I can use and how to import it? I have not found any 'built-in in the NLTK' spam corpora (here).

Thank you in advance.

like image 868
Lain Avatar asked Mar 26 '12 17:03

Lain


2 Answers

This presentation uses the enron-spam dataset (200,000+ emails).

The training and testing sets come from a dataset of 200,000+ Enron emails which contain both “spam” and “ham” emails

like image 72
Franck Dernoncourt Avatar answered Sep 20 '22 20:09

Franck Dernoncourt


Spam is not hard to obtain. Reasonably fresh spam in large quantities is not necessarily a big challenge, either; the big conundrum is how to obtain ham. If you are only building your own spam filter, of course, you can use your own ham.

The SpamAssassin Public Corpus is getting very old, but there you have it; http://spamassassin.apache.org/publiccorpus/

There is also the corpora from the TREC spam track, which are somewhat larger, but not much newer or less biased; http://plg.uwaterloo.ca/~gvcormac/treccorpus/

Various enthusiasts continue to publish their spam on the web, but most fail to include full headers etc. If you are only interested in "bag of words" filtering, maybe that's good enough.

like image 36
tripleee Avatar answered Sep 19 '22 20:09

tripleee