Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Newly inserted or updated row count in pentaho data integration

Tags:

etl

pentaho

pdi

I am new to Pentaho Data Integration; I need to integrate one database to another location as ETL Job. I want to count the number of insert/updat during the ETL job, and insert that count to another table . Can anyone help me on this?

like image 707
Sreejith Avatar asked Oct 20 '15 07:10

Sreejith


1 Answers

I don't think that there's a built-in functionality for returning the number of affected rows of an Insert/Update step in PDI to date.

Nevertheless, most database vendors are able to provide you with the ability to get the number of affected rows from a given operation.

In PostgreSQL, for instance, it would look like this:

/* Count affected rows from INSERT */
WITH inserted_rows AS (
    INSERT INTO ...
    VALUES
        ...
    RETURNING 1
)
SELECT count(*) FROM inserted_rows;

/* Count affected rows from UPDATE */
WITH updated_rows AS (
    UPDATE ...
    SET ...
    WHERE ...
    RETURNING 1
)
SELECT count(*) FROM updated_rows;

However, you're aiming to do that from within a PDI job, so I suggest that you try to get to a point where you control the SQL script.

Suggestion: Save the source data in a file on the target DB server, then use it, perhaps with a bulk loading functionality, to insert/update, then save the number of affected rows into a PDI variable. Note that you may need to use the SQL script step in the Job's scope.

EDIT: the implementation is a matter of chosen design, so the suggested solution is one of many. On a very high level, you could do something like the following.

  • Transformation I - extract data from source
    • Get the data from the source, be it a database or anything else
    • Prepare it for output in a way that it fits the target DB's structure
    • Save a CSV file using the text file output step on the file system
  • Parent Job
    • If the PDI server is the same as the target DB server:
      • Use the Execute SQL Script step to:
        • Read data from the file and perform the INSERT/UPDATE
        • Write the number of affected rows into a table (ideally, this table could also contain the time-stamp of the operation so you could keep track of things)
    • If the PDI server is NOT the same as the target DB server:
      • Upload the source data file to the server, e.g. with the FTP/SFTP file upload steps
      • Use the Execute SQL Script step to:
        • Read data from the file and perform the INSERT/UPDATE
        • Write the number of affected rows into a table

EDIT 2: another suggested solution

As suggested by @user3123116, you can use the Compare Fields step (if not part of your environment, check the marketplace for it).

The only shortcoming I see is that you have to query the target database before inserting/updating, which is, of course, less performant.

Eventually it could look like so (note that this is just the comparison and counting part): field compare

Also note that you can split the input of the source data stream (COPY, not DISTRIBUTE), and do your insert/update, but this stream must wait for the stream of the field comparison to end the query on the target database, otherwise you might end up with the wrong statistics.

like image 176
Yuval Herziger Avatar answered Dec 22 '22 09:12

Yuval Herziger