Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups <ul> <li>genuine email</li> <li>spam</li> </ul> How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam? Let's assume that the email fetching is already in place and we need to focus on classification part only. The main points which I would hope to get answered would be: <ol> <li>Which parameters to choose as the input for the NN, and why?</li> <li>What structure of the NN would most likely work best for such task?</li> </ol> Also any resource recommendations, or existing implementations (preferably in C#) are more than welcome Thank you EDIT <ul> <li>I am set on using neural networks as the main aspect on the project is to test how the NN approach would work for spam detection</li> <li>Also it is a "toy problem" simply to explore subject on neural networks and spam</li> </ul>

If you insist on NNs... I would calculate some features for every email Both Character-Based, Word-based, and Vocabulary features (About 97 as I count these): <ol> <li>Total no of characters (C) </li> <li>Total no of alpha chars / C Ratio of alpha chars</li> <li>Total no of digit chars / C </li> <li>Total no of whitespace chars/C </li> <li>Frequency of each letter / C (36 letters of the keyboard – A-Z, 0-9)</li> <li>Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ , \,/ )</li> <li>Total no of words (M) </li> <li>Total no of short words/M Two letters or less</li> <li>Total no of chars in words/C </li> <li>Average word length </li> <li>Avg. sentence length in chars </li> <li>Avg. sentence length in words </li> <li>Word length freq. distribution/M Ratio of words of length n, n between 1 and 15</li> <li>Type Token Ratio No. Of unique Words/ M</li> <li>Hapax Legomena Freq. of once-occurring words</li> <li>Hapax Dislegomena Freq. of twice-occurring words</li> <li>Yule’s K measure </li> <li>Simpson’s D measure </li> <li>Sichel’s S measure </li> <li>Brunet’s W measure </li> <li>Honore’s R measure </li> <li>Frequency of punctuation 18 punctuation chars: . ، ; ? ! : ( ) – “ « » < > [ ] { }</li> </ol> You could also add some more features based on the formatting: colors, fonts, sizes, ... used. Most of these measures can be found online, in papers, or even Wikipedia (they're all simple calculations, probably based on the other features). So with about 100 features, you need 100 inputs, some number of nodes in a hidden layer, and one output node. The inputs would need to be normalized according to your current pre-classified corpus. I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.

Are you set on doing it with a Neural Network? It sounds like you're set up pretty well to use Bayesian classification, which is outlined well in a couple of essays by Paul Graham: <ul> <li>A Plan for Spam</li> <li>Better Bayesian Filtering</li> </ul> The classified history you have access to would make very strong corpora to feed to a Bayesian algorithm, you'd probably end up with quite an effective result.

Neural networks for email spam detection

2 Answers

If you insist on NNs... I would calculate some features for every email

Both Character-Based, Word-based, and Vocabulary features (About 97 as I count these):

Total no of characters (C)
Total no of alpha chars / C Ratio of alpha chars
Total no of digit chars / C
Total no of whitespace chars/C
Frequency of each letter / C (36 letters of the keyboard – A-Z, 0-9)
Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ , \,/ )
Total no of words (M)
Total no of short words/M Two letters or less
Total no of chars in words/C
Average word length
Avg. sentence length in chars
Avg. sentence length in words
Word length freq. distribution/M Ratio of words of length n, n between 1 and 15
Type Token Ratio No. Of unique Words/ M
Hapax Legomena Freq. of once-occurring words
Hapax Dislegomena Freq. of twice-occurring words
Yule’s K measure
Simpson’s D measure
Sichel’s S measure
Brunet’s W measure
Honore’s R measure
Frequency of punctuation 18 punctuation chars: . ، ; ? ! : ( ) – “ « » < > [ ] { }

You could also add some more features based on the formatting: colors, fonts, sizes, ... used.

Most of these measures can be found online, in papers, or even Wikipedia (they're all simple calculations, probably based on the other features).

So with about 100 features, you need 100 inputs, some number of nodes in a hidden layer, and one output node.

The inputs would need to be normalized according to your current pre-classified corpus.

I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.

200

answered Oct 03 '22 17:10

Osama Al-Maadeed

Are you set on doing it with a Neural Network? It sounds like you're set up pretty well to use Bayesian classification, which is outlined well in a couple of essays by Paul Graham:

A Plan for Spam
Better Bayesian Filtering

The classified history you have access to would make very strong corpora to feed to a Bayesian algorithm, you'd probably end up with quite an effective result.

answered Oct 03 '22 17:10

Chad Birch

Related questions
                            
                                Implementing a linear, binary SVM (support vector machine)
                            
                                GBM R function: get variable importance separately for each class
                            
                                How do I make a U-matrix?
                            
                                Computing TF-IDF on the whole dataset or only on training data?
                            
                                What is the preferred ratio between the vocabulary size and embedding dimension?
                            
                                Is there any code or algorithm for signature recognition?
                            
                                How to penalize False Negatives more than False Positives
                            
                                multilayer_perceptron : ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.Warning?
                            
                                Deep learning for image classification [closed]
                            
                                Why is Random Forest with a single tree much better than a Decision Tree classifier?
                            
                                Implementing dropout from scratch
                            
                                What does the value of 'leaf' in the following xgboost model tree diagram means?
                            
                                Why do we maximize variance during Principal Component Analysis?
                            
                                Proper way to feed time-series data to stateful LSTM?
                            
                                R: ggplot display all dates on x axis
                            
                                Difference between OpenAI Gym environments 'CartPole-v0' and 'CartPole-v1'
                            
                                how to split a dataset into training and validation set keeping ratio between classes?
                            
                                How to explore a decision tree built using scikit learn
                            
                                TensorFlow TypeError: Value passed to parameter input has DataType uint8 not in list of allowed values: float16, float32
                            
                                Keras + TensorFlow Realtime training chart

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Neural networks for email spam detection

Tags:

machine-learning

neural-network

classification

spam-prevention

kristof

People also ask

2 Answers

Osama Al-Maadeed

Chad Birch

Recent Activity

Donate For Us