Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Production architecture for big data real time machine learning application?

I'm starting to learn some stuff about big data with a big focus on predictive analysis and for that I have a case study I would like to implement:

I have a dataset of servers health information that is polled every 5sec. I want to show the data that is retrieved but more importantly: I want to run a machine learning model previously built and show the results (alert about servers going to crash).

The machine learning model will be built by a machine learning specialist so that's completely out of scope. My job would be to integrate the machine learning model in a platform that runs the model and shows the results in a nice dashboard.

My problem is the "big picture" architecture of this system: I see that all the pieces already exist (cloudera+mahout) but I'm missing a simple integrated solution for all my needs and I don't believe the state of art is doing some custom software...

So, can anyone shed some light on production systems like this (showing data with predictive analysis)? Reference architecture for this? Tutorials/documentation?


Notes:

  1. I've investigated some related technologies: cloudera/hadoop, pentaho, mahout and weka. I know that Pentaho for example is able to store big data and run ad-hoc Weka analysis on that data. Using cloudera and Impala a data specialist can also run ad-hoc queries and analyse the data but that's not my goal. I want my system to run the ML model and show the results in a nice dashboard alongside the retrieved data. And I'm looking for a platform that already allows this usage instead of custom building.

  2. I'm focusing on Pentaho as it seems to have a nice integration of Machine Learning but every tutorial I read was more about "ad-hoc" ML analysis than real-time. Any tutorial on that subject will be welcomed.

  3. I don't mind opensource or commercial solutions (with a trial)

  4. Depending of the specifics maybe this isn't big data: more "traditional" solutions are also welcomed.

  5. Also real time here is a broad term: if the ML model has good performance running it every 5sec is good enough.

  6. ML model is static (isn't real-time updating or changing its behavior)

  7. I'm not looking for a customized application for my example as my focus is on the big picture: big data with predictive analysis generic platforms.

like image 248
AlfaTeK Avatar asked Dec 06 '12 16:12

AlfaTeK


1 Answers

(I'm an author of Mahout, and am commercializing a productization of some of the ML in Mahout, with a focus on both real-time and scale: Myrrix. I don't know that it's exactly what you are looking for, but seems to address some of the issues you pose here. It might be useful as another reference point.)

You have highlighted the tension between real-time and large-scale. These aren't the same thing. Hadoop, as a computation environment, scales well but can do nothing in real-time. Part of Mahout is built and Hadoop and so is also ML of that form. Weka, and the other parts of Mahout, are disposed to be more or less real-time, but then are challenged to scale.

An ML system that does both well necessarily has two layers: scalable offline model-building, with real-time online serving and updates. This is how it should look, IMHO, for recommenders for example: http://myrrix.com/design/

But, you don't have any issue with model building, right? Someone's going to build a static model? if so, that makes it much easier. Updating your model in real-time is useful, but complicating. If you don't have to, you're just generating predictions out of a static model, which is usually fast.

I don't think Pentaho is relevant if you are interested in ML, or, running something based on your own ML model.

1 query every 5 seconds is not challenging -- is this 1 query per 5 seconds per machine or something?

My advice is to simply create a server that can answer queries against the model. Just reuse any old HTTP server container like Tomcat. It can load the latest model as it is published from some backing store like HDFS or a NoSQL DB. You can create N instances of the server effortlessly as they don't seem to need to communicate.

The only custom code there is whatever you need to wrap your ML model. This is quite a simple problem if you truly don't need to build your own models or update them dynamically. If you do -- harder question but still possible to architect for.

like image 134
Sean Owen Avatar answered Oct 19 '22 05:10

Sean Owen