Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Machine learning for monitoring servers

I'm looking at pybrain for taking server monitor alarms and determining the root cause of a problem. I'm happy with training it using supervised learning and curating the training data sets. The data is structured something like this:

 * Server Type **A** #1
  * Alarm type 1
  * Alarm type 2
 * Server Type **A** #2
  * Alarm type 1
  * Alarm type 2
 * Server Type **B** #1
  * Alarm type **99**
  * Alarm type 2

So there are n servers, with x alarms that can be UP or DOWN. Both n and x are variable.

If Server A1 has alarm 1 & 2 as DOWN, then we can say that service a is down on that server and is the cause of the problem.

If alarm 1 is down on all servers, then we can say that service a is the cause.

There can potentially be multiple options for the cause, so straight classification doesn't seem appropriate.

I would also like to tie later sources of data to the net. Such as just scripts that ping some external service.

All the appropriate alarms may not be triggered at once, due to serial service checks, so it can start with one server down and then another server down 5 minutes later.

I'm trying to do some basic stuff at first:

from pybrain.tools.shortcuts import buildNetwork
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer


INPUTS = 2
OUTPUTS = 1

# Build network

# 2 inputs, 3 hidden, 1 output neurons
net = buildNetwork(INPUTS, 3, OUTPUTS)


# Build dataset

# Dataset with 2 inputs and 1 output
ds = SupervisedDataSet(INPUTS, OUTPUTS)


# Add one sample, iterable of inputs and iterable of outputs
ds.addSample((0, 0), (0,))



# Train the network with the dataset
trainer = BackpropTrainer(net, ds)

# Train 1000 epochs
for x in xrange(10):
    trainer.train()

# Train infinite epochs until the error rate is low
trainer.trainUntilConvergence()


# Run an input over the network
result = net.activate([2, 1])

But I[m having a hard time mapping variable numbers of alarms to static numbers of inputs. For example, if we add an alarm to a server, or add a server, the whole net needs to be rebuilt. If that is something that needs to be done, I can do it, but want to know if there's a better way.

Another option I'm trying to think of, is have a different net for each type of server, but I don't see how I can draw an environment-wide conclusion, since it will just make evaluations on a single host, instead of all hosts at once.

Which type of algorithm should I use and how do I map the dataset to draw environment-wide conclusions as a whole with variable inputs?

I'm very open to any algorithm that will work. Go is even better than python.

like image 880
Matt Williamson Avatar asked Sep 10 '14 16:09

Matt Williamson


People also ask

What is monitoring in machine learning?

Model monitoring refers to the process of closely tracking the performance of machine learning models in production. It enables your AI team to identify and eliminate a variety of issues, including bad quality predictions and poor technical performance.

How do you do ML model monitoring?

The most straightforward way to monitor your ML model is to constantly evaluate your performance on real-world data. You could customize triggers to notify you of significant changes in metrics such as accuracy, precision, or F1.

Is Datadog an AIOps?

With Datadog's Webhooks integration and monitoring APIs, teams can build automated AIOps (artificial intelligence for IT operations) workflows, such as archiving or deleting logs to reclaim disk space, or provisioning more instances of an application to reduce the memory pressure on app servers.

What are the 4 types of data that machine learning can use?

Most data can be categorized into 4 basic types from a Machine Learning perspective: numerical data, categorical data, time-series data, and text.


2 Answers

This is a challenging problem actually.

Representation of labels

It's difficult to represent your target labels for learning. As you pointed out,

If Server A1 has alarm 1 & 2 as DOWN, then we can say that service a is down on that server and is the cause of the problem.
If alarm 1 is down on all servers, then we can say that service a is the cause.
There can potentially be multiple options for the cause ...

I guess you need to list all possible options otherwise we cannot expect an ML algorithm to generalize. To make it simple, let's say you have only two possible causes of the problem:

1. Service problem 
2. Server problem  

Site-wise binary classifier

Suppose in your first ML model, the above are the only two causes. Then you are working on a site-wise binary classifier now. Probably logistic regression is better to get you started since it is easily interpretable.

To find out which server is the problem or which service is the problem, this can be your second step. To solve the second step, based on your example,

  • if it is a service problem, I think some decision rules can be manually derived so that the service name can be pinpointed. The idea is that you should see a significant amount of servers that are triggering the same alarm, right? Also see the advanced readings at the end to check more options.
  • if it is a server problem, you can construct a second binary classifier (an individual server side classifier), which runs on each server using only features coming from that server and answers the question: "if i have problem".

Features for the site-wise binary classifier

I assume all those alarms are the best source of your features. I guess using some summary statistics data as features could help more for the site-wise classifier here. For example,

  • the percentage of servers that are receiving alarm A as DOWN
  • the average length of time across all servers whose alarm B is DOWN
  • across all servers whose alarm B is DOWN, what is the percentage of them that also have alarm A down. ...

Features for the server-side binary classifier

You should explicitly use all alarm signals as the features for the server-side classifier. However, at training time, you should take all data from all of the servers. The labels are just "has-problem" or "has-no-problem". The training data will look like:

  alarm A On, alarm B On, alarm C on, ..., alarm Z on, has-problem
    YES,        YES,       NO,               YES,      YES
    NO,         YES,       NO,               NO,       NO
    ?,          NO,        YES,              NO,       NO

Note I used "?" to indicate some possible alarms you might have missing data (unknown state), which can be used to describe the situation below:

All the appropriate alarms may not be triggered at once, 
due to serial service checks,  so it can start with one server down and 
then another server down 5 minutes later.  

Some advanced readings

This problem is related to a few topics, e.g., alarm correlation, event correlation, fault diagnosis.

like image 120
greeness Avatar answered Sep 27 '22 16:09

greeness


There are a number of options for variable inputs, but two relatively simple ones are:

1) inputs which are not present are coded as 0.5, while inputs that are present are coded as either 0 or 1 2) in addition you could split the input into two, one for "present" vs. "not present", the other for "active" vs. "silent". Then, the network will have to use the interaction between the two to learn that the second column is only important if the first one is 1, and not if the first one is 0. But with enough training cases it can probably do this.

The methods can be combined, of course.

like image 38
rossdavidh Avatar answered Sep 27 '22 17:09

rossdavidh