I'm looking at pybrain for taking server monitor alarms and determining the root cause of a problem. I'm happy with training it using supervised learning and curating the training data sets. The data is structured something like this:
* Server Type **A** #1
* Alarm type 1
* Alarm type 2
* Server Type **A** #2
* Alarm type 1
* Alarm type 2
* Server Type **B** #1
* Alarm type **99**
* Alarm type 2
So there are n servers, with x alarms that can be UP
or DOWN
. Both n
and x
are variable.
If Server A1 has alarm 1 & 2 as DOWN
, then we can say that service a is down on that server and is the cause of the problem.
If alarm 1 is down on all servers, then we can say that service a is the cause.
There can potentially be multiple options for the cause, so straight classification doesn't seem appropriate.
I would also like to tie later sources of data to the net. Such as just scripts that ping some external service.
All the appropriate alarms may not be triggered at once, due to serial service checks, so it can start with one server down and then another server down 5 minutes later.
I'm trying to do some basic stuff at first:
from pybrain.tools.shortcuts import buildNetwork
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer
INPUTS = 2
OUTPUTS = 1
# Build network
# 2 inputs, 3 hidden, 1 output neurons
net = buildNetwork(INPUTS, 3, OUTPUTS)
# Build dataset
# Dataset with 2 inputs and 1 output
ds = SupervisedDataSet(INPUTS, OUTPUTS)
# Add one sample, iterable of inputs and iterable of outputs
ds.addSample((0, 0), (0,))
# Train the network with the dataset
trainer = BackpropTrainer(net, ds)
# Train 1000 epochs
for x in xrange(10):
trainer.train()
# Train infinite epochs until the error rate is low
trainer.trainUntilConvergence()
# Run an input over the network
result = net.activate([2, 1])
But I[m having a hard time mapping variable numbers of alarms to static numbers of inputs. For example, if we add an alarm to a server, or add a server, the whole net needs to be rebuilt. If that is something that needs to be done, I can do it, but want to know if there's a better way.
Another option I'm trying to think of, is have a different net for each type of server, but I don't see how I can draw an environment-wide conclusion, since it will just make evaluations on a single host, instead of all hosts at once.
Which type of algorithm should I use and how do I map the dataset to draw environment-wide conclusions as a whole with variable inputs?
I'm very open to any algorithm that will work. Go is even better than python.
Model monitoring refers to the process of closely tracking the performance of machine learning models in production. It enables your AI team to identify and eliminate a variety of issues, including bad quality predictions and poor technical performance.
The most straightforward way to monitor your ML model is to constantly evaluate your performance on real-world data. You could customize triggers to notify you of significant changes in metrics such as accuracy, precision, or F1.
With Datadog's Webhooks integration and monitoring APIs, teams can build automated AIOps (artificial intelligence for IT operations) workflows, such as archiving or deleting logs to reclaim disk space, or provisioning more instances of an application to reduce the memory pressure on app servers.
Most data can be categorized into 4 basic types from a Machine Learning perspective: numerical data, categorical data, time-series data, and text.
This is a challenging problem actually.
It's difficult to represent your target labels for learning. As you pointed out,
If Server A1 has alarm 1 & 2 as DOWN, then we can say that service a is down on that server and is the cause of the problem.
If alarm 1 is down on all servers, then we can say that service a is the cause.
There can potentially be multiple options for the cause ...
I guess you need to list all possible options otherwise we cannot expect an ML algorithm to generalize. To make it simple, let's say you have only two possible causes of the problem:
1. Service problem
2. Server problem
Suppose in your first ML model, the above are the only two causes. Then you are working on a site-wise binary classifier now. Probably logistic regression is better to get you started since it is easily interpretable.
To find out which server is the problem or which service is the problem, this can be your second step. To solve the second step, based on your example,
I assume all those alarms are the best source of your features. I guess using some summary statistics data as features could help more for the site-wise classifier here. For example,
You should explicitly use all alarm signals as the features for the server-side classifier. However, at training time, you should take all data from all of the servers. The labels are just "has-problem" or "has-no-problem". The training data will look like:
alarm A On, alarm B On, alarm C on, ..., alarm Z on, has-problem
YES, YES, NO, YES, YES
NO, YES, NO, NO, NO
?, NO, YES, NO, NO
Note I used "?" to indicate some possible alarms you might have missing data (unknown state), which can be used to describe the situation below:
All the appropriate alarms may not be triggered at once,
due to serial service checks, so it can start with one server down and
then another server down 5 minutes later.
This problem is related to a few topics, e.g., alarm correlation, event correlation, fault diagnosis.
There are a number of options for variable inputs, but two relatively simple ones are:
1) inputs which are not present are coded as 0.5, while inputs that are present are coded as either 0 or 1 2) in addition you could split the input into two, one for "present" vs. "not present", the other for "active" vs. "silent". Then, the network will have to use the interaction between the two to learn that the second column is only important if the first one is 1, and not if the first one is 0. But with enough training cases it can probably do this.
The methods can be combined, of course.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With