Which spam corpus I can use in NLTK?

Question

My question is fairly related to this one, but I decided to open another question thread. I hope it is fine.

I am building a spam filter using the NLTK in Python as well, but I've just started.

I am wondering which spam corpus I can use and how to import it? I have not found any 'built-in in the NLTK' spam corpora (here).

Thank you in advance.

Franck Dernoncourt · Accepted Answer

This presentation uses the enron-spam dataset (200,000+ emails).

The training and testing sets come from a dataset of 200,000+ Enron emails which contain both “spam” and “ham” emails

tripleee · Answer

Spam is not hard to obtain. Reasonably fresh spam in large quantities is not necessarily a big challenge, either; the big conundrum is how to obtain ham. If you are only building your own spam filter, of course, you can use your own ham.

The SpamAssassin Public Corpus is getting very old, but there you have it; http://spamassassin.apache.org/publiccorpus/

There is also the corpora from the TREC spam track, which are somewhat larger, but not much newer or less biased; http://plg.uwaterloo.ca/~gvcormac/treccorpus/

Various enthusiasts continue to publish their spam on the web, but most fail to include full headers etc. If you are only interested in "bag of words" filtering, maybe that's good enough.

Which spam corpus I can use in NLTK?

Tags:

python

nltk

spam-prevention

corpus

Lain

2 Answers

Franck Dernoncourt

tripleee

Recent Activity

Donate For Us

Which spam corpus I can use in NLTK?

Tags:

python

nltk

spam-prevention

corpus

Lain

2 Answers

Franck Dernoncourt

tripleee

Related questions

Recent Activity

Donate For Us