Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create multiple output files in Talend based on a column from an SQL Input

Tags:

etl

talend

I have a need of creating multiple output files based on a value (column) from an sql input in Talend Open Studio.

My tMSSQLInput returns about 50,000 rows in which where one of the columns is the building_name

Building A Building B Building C ....

So all rows that have value "Building A" should be in an excel file named "buildingA.xls", all rows that have "Building B" should be in an excel file named "buildingB.xls" and so on.

I have been trying to use tLoop or tForEach along with tIterateToFlow but I am not sure that I know how to implement it.

Thanks in advance.

like image 585
Viriato Avatar asked Sep 05 '14 11:09

Viriato


2 Answers

Gabriele's answer looks pretty good to me.

However, if you find yourself in a situation where you have a huge amount of data across a lot of buildings to the point where you can store any single building's rows worth of data in memory but not everything then I'd be inclined to use a slightly different approach.

In this example job I'm using MySQL database components just because I happen to have a local MySQL database but everything about this job holds true for Oracle or MS SQL Server as well:

Job Layout

At the very start we open a connection to the database using a tMySqlConnection component in this case. The remaining 2 database components (the tMySqlInput and the tMySqlRow) then use the shared connection details.

We start with grabbing a list of the buildings in the database using the following query in the tMySqlInput:

"SELECT DISTINCT building
FROM filesplittest"

This returns every distinct building.

We then iterate through every building, which allows us to keep just the records for that particular building in memory for the rest of the job.

We then use a tMySqlRow component to pull the data for that particular building of the iteration with a prepared statement. The example query I'm using looks like:

"SELECT building, foo, bar
FROM FileSplitTest
WHERE building = ?"

And then we configure the prepared statement in the advanced settings:

tMySqlRow advanced settings for prepared statement

Where I've said that the first parameter (Parameter Index = 1) is the building value we retrieved earlier and the tFlowToIterate helpfully pushed to the globalMap for us so we retrieve it from there using ((String)globalMap.get("row6.building")) in this case (it's the "building" column that was in the row6 flow).

When using a prepared statement you need to retrieve the data as a record set object so you'll want to set the schema of the tMySqlRow like so:

tMySqlRow schema

And then we parse it using the tParseRecordSet component:

tParseRecordSet component

With a schema to fit this example:

tParseRecordSet schema

Then we need to iterate across this set of data, appending it to an appropriately named CSV. To do this we use another tFlowToIterate component and take a slightly annoying detour via a tFixedFlowInput component to read each record's data out of the globalMap before passing it to the tFileOutputDelimited:

tFixedFlowInput configuration to read data in from the globalMap

And then finally we append it to a CSV named after the building:

tFileOutputDelimited append and dynamic file name from globalMap

Notice the append check box is checked, otherwise each iteration of the job will overwrite the previous one. We also name the file by the value in the building column.


As Gabriele mentioned, If your data fits happily in memory at all times you can simplify the job by instead just reading your data into a tHashOutput component and then filtering the data in the hash:

Simplified job layout with hash and keeping everything in memory

We start by reading all of the data into a tHashOutput component that then keeps the data in memory throughout the job. Talend sometimes hides these components for some odd reason but you can re-enable them by adding them back in the Project Properties -> Designer -> Palette settings:

How to re-enable the tHash components

Next we read the data back in from the hash using a tHashInput component (linked to the previous tHashOutput component - don't forget to add the same schema to the tHashInput component) and then use a tAggregateRow component and group by "building" to effectively take a distinct of the building values:

tAggregateRow settings

We then iterate through the distinct values for "building" using the tFlowToIterate and then filter the hash (being read in a second time) by the building value that is currently being iterated:

tFilterRow configuration

And finally, we once again make sure to append to a file named after the value in the building column:

tFileOutputDelimited

like image 158
ydaetskcoR Avatar answered Sep 30 '22 09:09

ydaetskcoR


I think it's better to do it verbosely in a two-steps job

  • First you get the list of file to be built
  • Then you route the rows on belonging file

I would design a job like this

tMSSSQL_Input_1------>tCacheOut_1
        |
        |
    OnSubjobOk
        |
        |
        v
    tCacheIn_1------->tAggregateRow------>tFlowToIterate
                                               /
                                              / 
                                         (iterate)
                                            /
                                           /
                                          /
       +---------------------------------+
       |
       |
       v
   tCacheIn_1------->tFilterRow-------->tFileOutDelimited

Let me explain what's going on

  • In the first subjob you dump the table in a memory buffer (tCacheOut, available on Talend Exchange, is a good component, but out-of-the-box tHashInput/tHashOutput will do the job, too) - this is for querying the DB just once, but if performance is not a requisite, you can trig multiple queries and avoid the use of the memory buffer
  • Then you read the dump a first time to make a distinct of your buildings (using tAggregateRow against the building column)
  • Then, you're going to switch to an iterate flow saving the current building value in a global variable, let's call it "my_building"
  • Then, you read a second time your dump and filter only rows of the current building. In facts, you can use globalMap.get("my_building") inside your filtering conditions
  • Finally, your going to save these rows in a proper file, again using globalMap.get("my_building") to parametrize your filename.
like image 35
Gabriele B Avatar answered Sep 30 '22 07:09

Gabriele B