Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between pentaho application server clustering and carte clustering

Tags:

pentaho

I am new with pentaho. Currently we need to cluster our pentaho CE to load balance the transformation we have. But honestly we are confused how to do it.

On the pentaho documentation Cluster the Application Server - Pentaho Documentation , it described how to cluster PDI application server. But on the other hand, there are documentation to cluster the Carte engine instead. To achieve a cluster which could distribute scheduled job, which application should I cluster? Is it the Carte or the PDI? What is the difference between clustering PDI Application Server and clustering Carte engine? thanks

like image 361
rama3i Avatar asked Dec 20 '18 04:12

rama3i


1 Answers

This is not a new question, and there is a lot of folk confuse Pentaho application server and Pentaho Data Integration (Kettle). Thank you marketing for renaming all products acquired by Pentaho as Pentaho . May be in future Pentaho will be renamed to Vantara, so it will help to mix things all together in advance.

If you want to cluster transfromation execution you are dealing with Pentaho Data Integration product which is not directly related to Pentaho BA server. Pentaho Data Integration previously known as Kettle, can live without Pentaho BA Server (or Pentaho Application Server) at all. And previously was a time when what is called PDI was not even a part of Pentaho at all, was named differently, and Carte server was already in place and a part of Kettle. Now all of that called Pentaho pentaho pentaho and from my point of view this is root of why it is hard to find difference between Carte and Pentaho BA server and any other in ecosystem.

In short - Carte server is used to execute PDI (kettle) jobs/transformations. Pentaho server is web application used as repository for reports and hosts report execution engine. This is completely different projects, even related together to perform perfect data analysis.

Why Carte?

Where Carte server is came from? It was born from Kettle. Kettle itself was born as tool to execute ETL transformations. It was called K-et(t)le because person who invent kettle was KDE fan (Hi, Matt!), and he added k + etl because all KDE fans like to add K as first letter to their product. I can mention file extensions .ktr of .kjb - first letter is 'K'. So tool was called kettle. It had UI to create transformations and jobs. Then - a tool created to run xml jobs and transformations without UI - it called Spoon as a joke because main tool was called Kettle. Then - people made Carte server - remote server or a cluster of servers to run ETL jobs and transformations - so as a main line, it was called Carte like a wine carte. This is about naming.

If you familiar with what is PDI jobs and transformations - it is just plain xml metadata files which describe what to do and how and where extract information. They need engine to be executed. They can be run in place in UI designer (PDI ui or kettle), they can be scheduled to run without UI (this is Spoon execution), or they can be executed on a 1...n remote servers - so this is Carte execution.

Carte itself is just jetty web server, which is start and listen to incoming xml. Remember, PDI jobs or transformations - just xml. It can be whole xml - that means whole transformation will be run on Carte, or a part of transformation (remote steps or remote transformations). Anyway - it is runs a java process which is waiting for xml metadata on how to extract transfrom and load.

When we talk about cluster of Carte servers - we talk about one or many jetty servers started all together. One of them can be master. If you will post you job/transformation to master - it will start process according to kjb/ktr xml and if it will find that this job/transfromation designed to run on a cluster of carte servers it will send metadata (in some cases data) and slaves will execute their part of job and returns data back to master. There is a lot of details on how to run you job/transf. on carte cluster - just imagine it is one or many jetty servers able to execute kettle jobs/transfomations.

Why Pentaho BA server?

Now about how Pentaho BA sever born.

...When we were talking about beginning of Kettle above, at the same time completely independently was a Pentaho company, who was engaged in creation BA servers. They acquire reporting engine (now called Pentaho Reporting), Mondrian engine to run mdx queries, and was a pretty successful company. They even invented x-actions in form of xml to run a bunch of commands for their BA server. But was luck of powerful data extraction engine. And then they found a Kettle mentioned above. This was a great success, they had good reporting engine, and good reporting engine meet good data extraction tool. So they acquire Kettle, rename it as Pentaho Data Integration (all their products called Pentaho) and it becomes a part of Pentaho BA server.

Pentaho BA and Carte?

How it works all together? When you run report on Pentaho BA server it tries to extract some information from anywhere. Pentaho BA server includes reporting engine which is responsible for retrieving data to generate report. If you have configured you report to read data from PDI (previously known as kettle) it is points to a job (.kjb) or transfromation (.ktr). When you execute report on Pentaho BA server - Pentaho BA server call reporting engine, which is founds that report require ktr/kjb execution - so it is calls PDI engine to execute job or transfromation to extract data. Job or transformation can be configured to run on one or many Carte servers - so on execution will be request to Carte server to execute this job or transfromation. Take a look: we have requested Pentaho BA server to execute report. Pentaho BA server is an tomcat server. Pentaho BA server build a report - but it need to run PDI engine - and PDI engine discovers we are going to execute clustered job or trasformation and it will call Carte servers - which is independent servers at all.

There is a cases when Pentaho BA server executes report - calls PDI engine and PDI engine does not require any clustering on Carte to run job or transfromation. In that case PID engine can execute in Pentaho BA java process itself (from 7 Petaho version it can be completely async).

There is a cases you can run job or transfromation even without a Pentaho BA server at all - using Spoon to run kjb/ktr - and if you have configure Carte cluster - you don't need Pentaho BA server at all.

And remember there is also Pentaho Big Data Plugin which is part of PDI/Pentaho/Kettle but has it's own history and considerations. If you will dig in Pentaho world you will meet it one day so don't be surprised.

And thank you marketing for calling all products Pentaho (I expect soon all will be called Vantara to mix all things once again), I remember from very beginning it is very hard to understand what came from where and what is why, and it is years lack of good documentation on all of that.

This is documentation on Carte server you may looking for. It is for 5+ but I don't expect things changed for now.

like image 185
Dzmitry Prakapenka Avatar answered Oct 25 '22 08:10

Dzmitry Prakapenka