Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MergeContent with nifi - inconsistent length

Tags:

apache-nifi

I am attempting to write a file on disk with the MergeContent processor, but I'm getting significantly varying file sizes - anywhere from one line to 806 lines. I've repeated the process many times over trying to figure out the newline demarcator as addressed in Apache NIFi MergeContent processor - set demarcator as new line and I've gotten really randomly sized files.

What parameters do I need to set to adhere to the following logic?

  1. Establish a single bin
  2. Route all flowfiles into bin
  3. If len(bin)>X or the age of the bin is greater than Max Bin Age, release the bin

To fully document, I currently have the following attributes defined: Merge Content Processor settings Merge Content Processor settings

As you can see, I've set "Max Bin Age" to "10 sec" following the syntax in https://github.com/apache/nifi/blob/31fba6b3332978ca2f6a1d693f6053d719fb9daa/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/java/org/apache/nifi/processors/standard/TestMergeContent.java#L219 (which is the only place I've managed to find an example of this value, the documentation seems incomplete on this parameter)

I've set "Maximum Number of Entries" to 5000, and "Maximum number of Bins" to 1

What do I need to do to aggregate my records following the logic above? I also tried using the "Correlation Attribute Name" parameter with an attribute guaranteed to be identical on all documents reaching this point, and saw the same

like image 274
Josh Harrison Avatar asked Jan 23 '16 00:01

Josh Harrison


1 Answers

The most important thing here is actually the minimum number of entries. What is happening is that the binning algorithm takes a lenient approach in terms of the number of items.

For your specific logic, you would want to let things as they stand and:

  • Set Minimum Number of Entries to 5000
  • Optionally, increase the maximum number of entries. Leaving it as configured will generate bins that are exactly 5000 entries except for those periods where the age interval has been eclipsed

Below is an image of the configuration above where min and max bin size are both 5000 and only 1 bin is handled at a time. In this case you'll see that exactly 20000 files have been merged into 4.

Sample execution for a min and max bin size of 5000

like image 117
apiri Avatar answered Nov 09 '22 05:11

apiri