I recently got access to a huge amount of server log data (at the new job). I have some experience in machine learning from college. The logs data include server logs, database access logs etc. I was wondering what kind of learning can be done from such a data.
One little thing i tried was to predict number of requests on a certain hour of the day based on the data of past week, which seemed ok but this is kind of trivial. So,
Have a look at Wei Xu et al (2010) Experience on Mining Google's Production Console Logs and the work they cite. In short they:
You probably cannot do 1. But maybe you can extract the variables writing your own "parser".
Also there has been a DARPA challenge to discover an attack in such data, but that's nearly 15 years ago.
There are some tools like splunk, but apart from a nice interface they do not offer much beyond simple searching and filtering. UPDATE: There is a anomaly detection plugin by prelert.
I am not aware of much more. Please let me know if you find anything else.
So what I would do:
You probably do not have access to the source code that generated the messages as Xu had, but I assume that a large portion of the logs could be covered by a small number of patterns (e.g. all the firewall logs will have the same pattern). You can write a regex parsers extracting features from those logs (e.g. A connection was refused at certain time).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With