Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Neural networks for email spam detection

Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups

  • genuine email
  • spam

How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam?

Let's assume that the email fetching is already in place and we need to focus on classification part only.

The main points which I would hope to get answered would be:

  1. Which parameters to choose as the input for the NN, and why?
  2. What structure of the NN would most likely work best for such task?

Also any resource recommendations, or existing implementations (preferably in C#) are more than welcome

Thank you

EDIT

  • I am set on using neural networks as the main aspect on the project is to test how the NN approach would work for spam detection
  • Also it is a "toy problem" simply to explore subject on neural networks and spam
like image 566
kristof Avatar asked Apr 20 '09 21:04

kristof


People also ask

Which algorithm is used for email spam detection?

SVM algorithms are very potent for the identification of patterns and classifying them into a specific class or group. They can be easily trained and according to some researchers, they outperform many of the popular email spam classification methods [130,131].

What is the best algorithm for spam filtering?

Several machine learning algorithms have been used in spam e-mail filtering, but Naıve Bayes algorithm is particularly popular in commercial and open-source spam filters [2]. This is because of its simplicity, which make them easy to implement and just need short training time or fast evaluation to filter email spam.

How is spam email detected?

An email server detects spam by using spam filter software which evaluates incoming emails on a number of criteria. (Yes, you can run an email server without having spam filter software enabled - you'd just see any and all spam email.)

What kind of algorithm would you use to distinguish between a spam email vs a non spam one?

Heuristic or Rule-Based Spam Filtering Technique Algorithms use pre-defined rules in the form of a regular expression to give a score to the messages present in the e-mails. Based on the scores generated, they segregate emails into spam and non-spam categories.


2 Answers

If you insist on NNs... I would calculate some features for every email

Both Character-Based, Word-based, and Vocabulary features (About 97 as I count these):

  1. Total no of characters (C)
  2. Total no of alpha chars / C Ratio of alpha chars
  3. Total no of digit chars / C
  4. Total no of whitespace chars/C
  5. Frequency of each letter / C (36 letters of the keyboard – A-Z, 0-9)
  6. Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ , \,/ )
  7. Total no of words (M)
  8. Total no of short words/M Two letters or less
  9. Total no of chars in words/C
  10. Average word length
  11. Avg. sentence length in chars
  12. Avg. sentence length in words
  13. Word length freq. distribution/M Ratio of words of length n, n between 1 and 15
  14. Type Token Ratio No. Of unique Words/ M
  15. Hapax Legomena Freq. of once-occurring words
  16. Hapax Dislegomena Freq. of twice-occurring words
  17. Yule’s K measure
  18. Simpson’s D measure
  19. Sichel’s S measure
  20. Brunet’s W measure
  21. Honore’s R measure
  22. Frequency of punctuation 18 punctuation chars: . ، ; ? ! : ( ) – “ « » < > [ ] { }

You could also add some more features based on the formatting: colors, fonts, sizes, ... used.

Most of these measures can be found online, in papers, or even Wikipedia (they're all simple calculations, probably based on the other features).

So with about 100 features, you need 100 inputs, some number of nodes in a hidden layer, and one output node.

The inputs would need to be normalized according to your current pre-classified corpus.

I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.

like image 200
Osama Al-Maadeed Avatar answered Oct 03 '22 17:10

Osama Al-Maadeed


Are you set on doing it with a Neural Network? It sounds like you're set up pretty well to use Bayesian classification, which is outlined well in a couple of essays by Paul Graham:

  • A Plan for Spam
  • Better Bayesian Filtering

The classified history you have access to would make very strong corpora to feed to a Bayesian algorithm, you'd probably end up with quite an effective result.

like image 27
Chad Birch Avatar answered Oct 03 '22 17:10

Chad Birch