Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups
How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam?
Let's assume that the email fetching is already in place and we need to focus on classification part only.
The main points which I would hope to get answered would be:
Also any resource recommendations, or existing implementations (preferably in C#) are more than welcome
Thank you
EDIT
SVM algorithms are very potent for the identification of patterns and classifying them into a specific class or group. They can be easily trained and according to some researchers, they outperform many of the popular email spam classification methods [130,131].
Several machine learning algorithms have been used in spam e-mail filtering, but Naıve Bayes algorithm is particularly popular in commercial and open-source spam filters [2]. This is because of its simplicity, which make them easy to implement and just need short training time or fast evaluation to filter email spam.
An email server detects spam by using spam filter software which evaluates incoming emails on a number of criteria. (Yes, you can run an email server without having spam filter software enabled - you'd just see any and all spam email.)
Heuristic or Rule-Based Spam Filtering Technique Algorithms use pre-defined rules in the form of a regular expression to give a score to the messages present in the e-mails. Based on the scores generated, they segregate emails into spam and non-spam categories.
If you insist on NNs... I would calculate some features for every email
Both Character-Based, Word-based, and Vocabulary features (About 97 as I count these):
You could also add some more features based on the formatting: colors, fonts, sizes, ... used.
Most of these measures can be found online, in papers, or even Wikipedia (they're all simple calculations, probably based on the other features).
So with about 100 features, you need 100 inputs, some number of nodes in a hidden layer, and one output node.
The inputs would need to be normalized according to your current pre-classified corpus.
I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.
Are you set on doing it with a Neural Network? It sounds like you're set up pretty well to use Bayesian classification, which is outlined well in a couple of essays by Paul Graham:
The classified history you have access to would make very strong corpora to feed to a Bayesian algorithm, you'd probably end up with quite an effective result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With