Anomaly detection using Python [closed]

Tags:

I work for a webhost and my job is to find and cleanup hacked accounts. The way I find a good 90% of shells\malware\injections is to look for files that are "out of place." For example, eval(base64_decode(.......)), where "....." is a whole bunch of base64'ed text that is usually never good. Odd looking files jump out at me as I grep through files for key strings.

If these files jump out at me as a human I'm sure I can build some kind of profiler in python to look for things that are "out of place" statistically and flag them for manual review. To start off I thought I can compare the length of lines in php files containing key strings (eval, base64_decode, exec, gunzip, gzinflate, fwrite, preg_replace, etc.) and look for lines that deviate from the average by 2 standard deviations.

The line length varies widely and I'm not sure if this would be a good statistic to use. Another approach would be to assign weighted rules to cretin things (line length over or under threshold = X points, contains the word upload = Y points) but I'm not sure what I can actually do with the scores or how to score the each attribute. My statistics is a little rusty.

Could anyone point me in the right direction (guides, tutorials, libraries) for statistical profiling?

366

asked Jul 31 '11 21:07

Josh M

1 Answers

Here's a simple machine learning approach to the problem, and is what I'd do to get started on this problem and develop a baseline classifier:

Build up a corpus of scripts and attach a label either 'good' (label= 0) or 'bad' (label = 1) the more the better. Try to ensure that the 'bad' scripts are a reasonable fraction of the total corpus,50-50 good/bad is ideal.

Develop binary features that indicate suspicious or bad scripts. For example, the presence of 'eval', the presence of 'base64_decode'.Be as comprehensive as you can be and don't be afraid of including afeature that might capture some 'good' scripts too. One way to help to do this might be to calculate the frequency counts of words in the two classes of script and select as features words that appear prominently in 'bad' but less prominently in 'good'.

Run the feature generator over the corpus and build up a binary matrix of features with labels.

Split the corpus into train (80% of examples) and test sets (20%). Using the scikit learn library, train a few different classification algorithms (random forests, support vector machines, naive bayes etc) with the training set and test their performance on the unseen test set.

Hopefully I have a reasonable classification accuracy to benchmark against. I'd then look at improving the features, some unsupervised methods (without labels) and more specialised algorithms to get better performance.

For resources, Andrew Ng's Coursera course on Machine Learning (which includes example spam classification, I believe) is a good start.

answered Oct 02 '22 10:10

Mike

Related questions
                            
                                Creating regular Delaunay grid in with scipy
                            
                                Using OpenCV Python, How would you make all black pixels transparent, and then overlay it over original image
                            
                                overplot multiple sets of data with hexbin
                            
                                PyCharm remote debugging (pydevd) does not connect
                            
                                django: exclude models from migrations
                            
                                What is the difference between Pandas Series.apply() and Series.map()? [duplicate]
                            
                                How to serialize/deserialize Pandas DataFrame to and from ProtoBuf/Gzip in a RESTful Flask App?
                            
                                What is the proper way to perform Latent Class Analysis in Python?
                            
                                Evaluate inner product of bra and ket in Sympy Quantum
                            
                                Setting Django admin display times to local time?
                            
                                Fatal Python error: Cannot recover from stack overflow
                            
                                Enable PK based filtering in Django Graphene Relay while retaining Global IDs
                            
                                Keras fit_generator() - How does batch for time series work?
                            
                                ERROR: Command errored out with exit status 1 while installing requirements
                            
                                Slicing behavior of python range()[:]
                            
                                Django: running manage.py always aborts
                            
                                Is there Python Clang wrapper in the vein of pygccxml which wraps GCC-XML?
                            
                                A Viable Solution for Word Splitting Khmer?
                            
                                How do I tell django-nose where my tests are?
                            
                                Simple license protection for Python app

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Anomaly detection using Python [closed]

Tags:

python

machine-learning

statistics

intrusion-detection

Josh M

People also ask

1 Answers

Mike

Recent Activity

Donate For Us