Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does hive instantiate a new UDF object for each record?

Tags:

hadoop

hive

Say I'm building a UDF class called StaticLookupUDF that has to load some static data from a local file during construction.

In this case I want to ensure that I'm not replicating work more than I need to be, in that I don't want to re-load the static data on every call to the evaluate() method.

Clearly each mapper uses it's own instantiation of the UDF, but does a new instance get generated for each record processed?

For example, a mapper is going to process 3 rows. Does it create a single StaticLookupUDF and call evaluate() 3 times, or does it create a new StaticLookupUDF for each record, and call evaluate only once per instance?

If the second example is true, in what alternate way should I structure this?

Couldn't find this anywhere in the docs, I'm going to look through the code, but figured I'd ask the smart people here at the same time.

like image 426
Matthew Rathbone Avatar asked Nov 05 '22 12:11

Matthew Rathbone


1 Answers

Still not totally sure about this, but I got around it by having a static lazy value that loaded data as needed.

This way you have one-instance of the static value per mapper. So if you're reading in a dataset and you have 6 map tasks you'll read in the data 6 times. Not ideal, but better than once per record.

like image 93
Matthew Rathbone Avatar answered Nov 15 '22 07:11

Matthew Rathbone