Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to use an existing Hive permanent UDF from Spark SQL

Tags:

I have previously registered a UDF with hive. It is permanent not TEMPORARY. It works in beeline.

CREATE FUNCTION normaliseURL AS 'com.example.hive.udfs.NormaliseURL' USING JAR 'hdfs://udfs/hive-udfs.jar'; 

I have spark configured to use the hive metastore. The config is working as I can query hive tables. I can see the UDF;

In [9]: spark.sql('describe function normaliseURL').show(truncate=False) +-------------------------------------------+ |function_desc                              | +-------------------------------------------+ |Function: default.normaliseURL             | |Class: com.example.hive.udfs.NormaliseURL  | |Usage: N/A.                                | +-------------------------------------------+ 

However I cannot use the UDF in a sql statement;

spark.sql('SELECT normaliseURL("value")') AnalysisException: "Undefined function: 'default.normaliseURL'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7" 

If I attempt to register the UDF with spark (bypassing the metastore) it fails to register it, suggesting that it does already exist.

In [12]: spark.sql("create function normaliseURL as 'com.example.hive.udfs.NormaliseURL'") AnalysisException: "Function 'default.normaliseURL' already exists in database 'default';" 

I'm using Spark 2.0, hive metastore 1.1.0. The UDF is scala, my spark driver code is python.

I'm stumped.

  • Am I correct in my assumption that Spark can utilise metastore-defined permanent UDFs?
  • Am I creating the function correctly in hive?
like image 641
Rob Cowie Avatar asked Aug 18 '16 16:08

Rob Cowie


People also ask

Can we use Hive UDF in spark?

Spark SQL supports integration of Hive UDFs, UDAFs and UDTFs. Similar to Spark UDFs and UDAFs, Hive UDFs work on a single row as input and generate a single row as output, while Hive UDAFs operate on multiple rows and return a single aggregated row as a result.

Why we should not use UDF in spark?

It is well known that the use of UDFs (User Defined Functions) in Apache Spark, and especially in using the Python API, can compromise our application performace. For this reason, at Damavis we try to avoid their use as much as possible infavour of using native functions or SQL .

How do I create a permanent function in hive?

If Hive is not in local mode, then the resource location must be a non-local URI such as an HDFS location. The function will be added to the database specified, or to the current database at the time that the function was created. The function can be referenced by fully qualifying the function name (db_name.

What is the use of spark UDF Register ()?

User-Defined Functions (UDFs) are user-programmable routines that act on one row. This documentation lists the classes that are required for creating and registering UDFs. It also contains examples that demonstrate how to define and register UDFs and invoke them in Spark SQL.


1 Answers

Issue is Spark 2.0 is not able to execute the functions whose JARs are located on HDFS.

Spark SQL: Thriftserver unable to run a registered Hive UDTF

One workaround is to define the function as a temporary function in Spark job with jar path pointing to a local edge-node path. Then call the function in same Spark job.

CREATE TEMPORARY FUNCTION functionName as 'com.test.HiveUDF' USING JAR '/user/home/dir1/functions.jar' 
like image 72
Manmohan Avatar answered Sep 19 '22 15:09

Manmohan