Can someone explain in details how NiFi processors like GetFile or QueryDatabaseTable store the rows when the next processor is not available to receive or process any data? Would the data gets piped up in memory and then gets swapped to disks when the size exceeds some threshold? Potentially would it have the risk of running out of memory or losing data?
NiFi stores your data in a repository while it traverses its way through your system. There are three repositories - the “FlowFile Repository,” the “Provenance Repository,” and the “Content Repository.” Content starts out being written to a Flowfile Repository which is streamed to a Content Repository.
The core concepts of NiFi A FlowFile represents each object moving through the system and for each one, NiFi keeps track of a map of key/value pair attribute strings and its associated content of zero or more bytes. Processors actually perform the work.
The NiFi Documentation is up to date and explain nicely how clusters work. To answer your question in short, if a node fails, the data that was on that node when it failed will require manual intervention to recover. If you lose the storage on the failed node, you lose the data on that node.
When data is transferred to a clustered instance of NiFi via an RPG, the RPG will first connect to the remote instance whose URL is configured to determine which nodes are in the cluster and how busy each node is. This information is then used to load balance the data that is pushed to each node.
I would recommend reading the Apache NiFi documentation, specifically the "Apache NiFi in Depth" document to understand how data is stored and passed through NiFi:
https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
The short answer is that data is always written to disk in NiFi's internal repositories. A flow file has attributes that are persisted to the flow file repository and content that is persisted to the content repository. The content is not held in memory unless a processor chooses to read the entire content into memory to perform some processing.
When flow files are in a queue, none of the content is held in memory, just flow file objects that know where the content lives on disk. When the queue reaches a certain size, these flow file objects will be swapped to disk which allows you to have a queue with millions of flow files, without actually having a million flow file objects.
There is also a concept of back-pressure to control a maximum size of a queue based on number of flow files, or size of all flow files in a queue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With