I'm looking into ETL tools (like Talend) and investigating whether Apache Nifi could be used. Could Nifi be used to perform the following: <ol> <li>Pick up two CSV files that are placed on local disk</li> <li>Join the CSVs on a common column</li> <li>Write the joined CSV to disk</li> </ol> I've tried setting up a job in Nifi, but couldn't see how to perform the join of two separate CSV files. Is this task possible in Apache Nifi? It looks like the QueryDNS processor could be used to perform enrichment of one CSV file using the other, but that seems to be over-complicated for this use case. Here's an example of the input CSVs, which need to be joined on state_id: <h3>Input files</h3> customers.csv <pre class="prettyprint"><code>id | name | address | state_id ---|------|--------------|--------- 1 | John | 10 Blue Lane | 100 2 | Bob | 15 Green St. | 200 </code></pre> states.csv <pre class="prettyprint"><code>state_id | state ---------|--------- 100 | Alabama 200 | New York </code></pre> <h3>Output file</h3> output.csv <pre class="prettyprint"><code>id | name | address | state ---|------|--------------|--------- 1 | John | 10 Blue Lane | Alabama 2 | Bob | 15 Green St. | New York </code></pre>

The typical pattern one follows for this is to load the reference set into a map cache controller service in NiFi. In this case that is the <code>states.csv</code> data. Then the live feed of customer data comes in and is enriched with this reference data using something like <code>ReplaceText</code> or you could even write a custom processor in Groovy. There are a lot of ways to slice this. There is also a JIRA/PR coming for making this even easier. There are elements of live stream joins that are best done in processing systems like Apache Storm, Spark, and Flink, but for the case you mention it can be done well in NiFi.

How to join two CSVs with Apache Nifi

Input files

customers.csv

id | name | address      | state_id
---|------|--------------|---------
1  | John | 10 Blue Lane | 100
2  | Bob  | 15 Green St. | 200

states.csv

state_id | state
---------|---------
100      | Alabama
200      | New York

Output file

output.csv

id | name | address      | state
---|------|--------------|---------
1  | John | 10 Blue Lane | Alabama
2  | Bob  | 15 Green St. | New York

831

asked Mar 20 '17 16:03

Andy Longwill

2 Answers

Apache NiFi is more of a dataflow tool and not really made to perform arbitrary joins of streaming data. Typically those types of operations are better suited to stream processing systems like Storm, Flink, Apex, etc, or ETL tools.

The types of joins that NiFi can do well are enrichment look ups where there is a fixed size lookup dataset, and for each record in the incoming data you use the lookup dataset to retrieve some value. For example, in your case there could be a processor called LookUpState which has a property "State Data" which points to a file containing all the states, then the customers.csv could be the input to this processor.

A community member started a project to make a generic lookup service for NiFi: https://github.com/jfrazee/nifi-lookup-service

answered Dec 09 '22 12:12

Bryan Bende

The typical pattern one follows for this is to load the reference set into a map cache controller service in NiFi. In this case that is the states.csv data. Then the live feed of customer data comes in and is enriched with this reference data using something like ReplaceText or you could even write a custom processor in Groovy. There are a lot of ways to slice this. There is also a JIRA/PR coming for making this even easier. There are elements of live stream joins that are best done in processing systems like Apache Storm, Spark, and Flink, but for the case you mention it can be done well in NiFi.

answered Dec 09 '22 13:12

Joe Witt

Related questions
                            
                                SSIS - Script Component, Split single row to multiple rows (Parent Child Variation)
                            
                                SSIS - fill unmapped columns in table in OLE DB Destination
                            
                                How to specify join types in AWS Glue?
                            
                                Best method to move a large SQL Server table from one database to another?
                            
                                Is there a way to backfill historical data (once) for a new Flow in Prefect?
                            
                                Is Web Service suitable for ETL purpose?
                            
                                ETL SSIS : Redirecting error rows to a seperate table
                            
                                Extract TIME from DATETIME - informix
                            
                                Newly inserted or updated row count in pentaho data integration
                            
                                System.ArgumentException: Object is not an ADODB.RecordSet or an ADODB.Record
                            
                                How to create multiple output files in Talend based on a column from an SQL Input
                            
                                import tab-delimited txt into Access table using VBA
                            
                                Difference Between Processor Properties and Flowfile Attributes in Apache NiFi
                            
                                Data extraction with Excel
                            
                                Conditional routing in Apache NiFi
                            
                                What is Extract/Transform/Load (ETL)?
                            
                                Talend - generating n multiple rows from 1 row
                            
                                SSIS: Flat File default length
                            
                                Automatically Drop and Recreate current indexes
                            
                                SSIS Catalog Package error - Version of FlatFile destination not compatible with version of DataFlow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to join two CSVs with Apache Nifi

Tags:

etl

apache-nifi