Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Machine Learning on server log data

I recently got access to a huge amount of server log data (at the new job). I have some experience in machine learning from college. The logs data include server logs, database access logs etc. I was wondering what kind of learning can be done from such a data.

One little thing i tried was to predict number of requests on a certain hour of the day based on the data of past week, which seemed ok but this is kind of trivial. So,

  • What kind of learning can be done from such data?
    • May be predicting the probability of an IP doing spam clicks on ads(yes the company is into that) based on some usage pattern of previous spammers?
    • May be predicting at what time the traffic may shoot up.
  • Are there any existing tools/projects which specifically leverage?
  • Any interesting resources/papers which talk about similar stuff?
  • Also, data related process activity at over a certain time on server. can this be any useful for learning?
like image 788
Swair Avatar asked Aug 26 '12 10:08

Swair


1 Answers

Have a look at Wei Xu et al (2010) Experience on Mining Google's Production Console Logs and the work they cite. In short they:

  1. Extract logging templates (e.g. "Writing to file %s") from the the source code to extract identifiers from the logs (the thing in the log corresponding to %s is an identifier). They use certain heuristics to distinguish identifiers from non-identifiers (e.g. time).
  2. Use ratios between values instead of raw number (e.g. ratio of failed and all commits)
  3. Use Principal Component Analysis to discover anomalies in vectors of such features.

You probably cannot do 1. But maybe you can extract the variables writing your own "parser".

Also there has been a DARPA challenge to discover an attack in such data, but that's nearly 15 years ago.

There are some tools like splunk, but apart from a nice interface they do not offer much beyond simple searching and filtering. UPDATE: There is a anomaly detection plugin by prelert.

I am not aware of much more. Please let me know if you find anything else.

So what I would do:

  1. Extract features/variables from the logs

You probably do not have access to the source code that generated the messages as Xu had, but I assume that a large portion of the logs could be covered by a small number of patterns (e.g. all the firewall logs will have the same pattern). You can write a regex parsers extracting features from those logs (e.g. A connection was refused at certain time).

  1. Try anomaly detection (PCA, or just deviation from the mean on them individually) and prediction on them.
like image 50
Jirka Avatar answered Oct 12 '22 20:10

Jirka