Say I'm building a UDF class called StaticLookupUDF that has to load some static data from a local file during construction.
In this case I want to ensure that I'm not replicating work more than I need to be, in that I don't want to re-load the static data on every call to the evaluate() method.
Clearly each mapper uses it's own instantiation of the UDF, but does a new instance get generated for each record processed?
For example, a mapper is going to process 3 rows. Does it create a single StaticLookupUDF and call evaluate() 3 times, or does it create a new StaticLookupUDF for each record, and call evaluate only once per instance?
If the second example is true, in what alternate way should I structure this?
Couldn't find this anywhere in the docs, I'm going to look through the code, but figured I'd ask the smart people here at the same time.
Still not totally sure about this, but I got around it by having a static lazy value that loaded data as needed.
This way you have one-instance of the static value per mapper. So if you're reading in a dataset and you have 6 map tasks you'll read in the data 6 times. Not ideal, but better than once per record.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With