Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Nifi, what is the difference between FirstInFirstOutPrioritizer and OldestFlowFileFirstPrioritizer

User guide https://nifi.apache.org/docs/nifi-docs/html/user-guide.html has the below details on prioritizers, could you please help me understand how these are different and provide any real time example.

FirstInFirstOutPrioritizer: Given two FlowFiles, the one that reached the connection first will be processed first.

OldestFlowFileFirstPrioritizer: Given two FlowFiles, the one that is oldest in the dataflow will be processed first. 'This is the default scheme that is used if no prioritizers are selected.'

like image 221
venkata Avatar asked Apr 05 '18 14:04

venkata


People also ask

What is the default prioritization scheme in NiFi?

NiFi allows the setting of one or more prioritization schemes for how data is retrieved from a queue. The default is oldest first, but there are times when data should be pulled newest first, largest first, or some other custom scheme.

What are FlowFiles in NiFi?

FlowFiles are at the heart of NiFi and its flow-based design. A FlowFile is a data record, which consists of a pointer to its content (payload) and attributes to support the content, that is associated with one or more provenance events.

What is Remote process group in NiFi?

Apache NiFi Remote Process Group or RPG enables flow to direct the FlowFiles in a flow to different NiFi instances using Site-to-Site protocol. As of version 1.7. 1, NiFi does not offer balanced relationships, so RPG is used for load balancing in a NiFi data flow.

What is yield duration in NiFi?

nifi. bored. yield. duration=10 millis – This property is designed to help with CPU utilization by preventing processors that are using the timer driven scheduling strategy from using excessive CPU when there is no work to do.


2 Answers

Imagine two processors A and B that are both connected to a funnel, and then the funnel connects to processor C.

Scenario 1 - The connection between the funnel and processor C has first-in-first-out prioritizer.

In this case, the flow files in the queue between the funnel and connection C will be processed strictly based on the order they reached the queue.

Scenario 2 - The connection between the funnel and processor C has oldest-flow-file-first prioritizer.

In this case, there could already be flow files in the queue between the funnel and connection C, but one of the processors transfers a flow to that queue that is older than all the flow files in that queue, it will jump to the front.

You could imagine that some flow files come from a different portion of the flow that takes way longer to process than other flow files, but they both end up funneled into the same queue, so these flow files from the longer processing part are considered older.

like image 137
Bryan Bende Avatar answered Oct 10 '22 09:10

Bryan Bende


Apache NiFi handles data from many disparate sources and can route it through a number of different processors. Let's use the following example (ignore the processor types, just focus on the titles):

NiFi flow demonstrating prioritization scenarios

First, the relative rate of incoming data can be different depending on the source/ingestion point. In this case, the database poll is being done once per minute, while the HTTP poll is every 5 seconds, and the file tailing is every second. So even if a database record is 59 seconds "older" than another, if they are captured in the same execution of the processor, they will enter NiFi at the same time and the flowfile(s) (depending on splitting) will have the same origin time.

If some data coming into the system "is dirty", it gets routed to a processor which "cleans" it. This processor takes 3 seconds to execute.

If both the clean relationship and the success relationship from "Clean Data" went directly to "Process Data", you wouldn't be able to control the order that those flowfiles were processed. However, because there is a funnel that merges those queues, you can choose a prioritizer on the queued queue, and control that order. Do you want the first flowfile to enter that queue processed first, or do you want flowfiles that entered NiFi earlier to be processed first, even if they entered this specific queue after a newer flowfile?

This is a contrived example, but you can apply this to disaster recovery situations where some data was missed for a time window and is now being recovered, or a flow that processes time-sensitive data and the insights aren't valid after a certain period of time has passed. If using backpressure or getting data in large (slow) batches, you can see how in some cases, oldest first is less valuable and vice versa.

like image 41
Andy Avatar answered Oct 10 '22 09:10

Andy