I have a need of creating multiple output files based on a value (column) from an sql input in Talend Open Studio.
My tMSSQLInput returns about 50,000 rows in which where one of the columns is the building_name
Building A Building B Building C ....
So all rows that have value "Building A" should be in an excel file named "buildingA.xls", all rows that have "Building B" should be in an excel file named "buildingB.xls" and so on.
I have been trying to use tLoop or tForEach along with tIterateToFlow but I am not sure that I know how to implement it.
Thanks in advance.
Gabriele's answer looks pretty good to me.
However, if you find yourself in a situation where you have a huge amount of data across a lot of buildings to the point where you can store any single building's rows worth of data in memory but not everything then I'd be inclined to use a slightly different approach.
In this example job I'm using MySQL database components just because I happen to have a local MySQL database but everything about this job holds true for Oracle or MS SQL Server as well:
At the very start we open a connection to the database using a tMySqlConnection component in this case. The remaining 2 database components (the tMySqlInput and the tMySqlRow) then use the shared connection details.
We start with grabbing a list of the buildings in the database using the following query in the tMySqlInput:
"SELECT DISTINCT building
FROM filesplittest"
This returns every distinct building.
We then iterate through every building, which allows us to keep just the records for that particular building in memory for the rest of the job.
We then use a tMySqlRow component to pull the data for that particular building of the iteration with a prepared statement. The example query I'm using looks like:
"SELECT building, foo, bar
FROM FileSplitTest
WHERE building = ?"
And then we configure the prepared statement in the advanced settings:
Where I've said that the first parameter (Parameter Index = 1) is the building value we retrieved earlier and the tFlowToIterate helpfully pushed to the globalMap for us so we retrieve it from there using ((String)globalMap.get("row6.building"))
in this case (it's the "building" column that was in the row6 flow).
When using a prepared statement you need to retrieve the data as a record set object so you'll want to set the schema of the tMySqlRow like so:
And then we parse it using the tParseRecordSet component:
With a schema to fit this example:
Then we need to iterate across this set of data, appending it to an appropriately named CSV. To do this we use another tFlowToIterate component and take a slightly annoying detour via a tFixedFlowInput component to read each record's data out of the globalMap before passing it to the tFileOutputDelimited:
And then finally we append it to a CSV named after the building:
Notice the append check box is checked, otherwise each iteration of the job will overwrite the previous one. We also name the file by the value in the building column.
As Gabriele mentioned, If your data fits happily in memory at all times you can simplify the job by instead just reading your data into a tHashOutput component and then filtering the data in the hash:
We start by reading all of the data into a tHashOutput component that then keeps the data in memory throughout the job. Talend sometimes hides these components for some odd reason but you can re-enable them by adding them back in the Project Properties -> Designer -> Palette settings:
Next we read the data back in from the hash using a tHashInput component (linked to the previous tHashOutput component - don't forget to add the same schema to the tHashInput component) and then use a tAggregateRow component and group by "building" to effectively take a distinct of the building values:
We then iterate through the distinct values for "building" using the tFlowToIterate and then filter the hash (being read in a second time) by the building value that is currently being iterated:
And finally, we once again make sure to append to a file named after the value in the building column:
I think it's better to do it verbosely in a two-steps job
I would design a job like this
tMSSSQL_Input_1------>tCacheOut_1
|
|
OnSubjobOk
|
|
v
tCacheIn_1------->tAggregateRow------>tFlowToIterate
/
/
(iterate)
/
/
/
+---------------------------------+
|
|
v
tCacheIn_1------->tFilterRow-------->tFileOutDelimited
Let me explain what's going on
tCacheOut
, available on Talend Exchange, is a good component, but out-of-the-box tHashInput
/tHashOutput
will do the job, too) - this is for querying the DB just once, but if performance is not a requisite, you can trig multiple queries and avoid the use of the memory buffertAggregateRow
against the building column)my_building
"globalMap.get("my_building")
inside your filtering conditionsglobalMap.get("my_building")
to parametrize your filename.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With