This works: <pre class="prettyprint"><code>from pyspark.sql import SparkSession from pyspark.sql.functions import pandas_udf import pandas as pd spark = SparkSession.builder.getOrCreate() @pandas_udf(returnType="long") def add_one(v: pd.Series) -> pd.Series: return v.add(1) spark.udf.register("add_one", add_one) spark.sql("select add_one(1)").show() </code></pre> However, I'm wondering if/how I can make the following work: <pre class="prettyprint"><code>$ spark-sql -e 'select add_one(1)' </code></pre>

Now that would be very nice if one could use that. I'm afraid that this is currently not possible. Funny is that nobody actually mentions it. The information is actually "hidden" in the apache spark documentation in a small note: <blockquote> Note that the Spark SQL CLI cannot talk to the Thrift JDBC server. </blockquote> As you probably understand the implications that means you can't call <code>UDFs</code> from the CLI <code>spark-sql</code>. Here is the link to the documentation. One can double check the <code>bin/spark-sql</code> source code at github what is actually done: <pre class="prettyprint"><code>if [ -z "${SPARK_HOME}" ]; then source "$(dirname "$0")"/find-spark-home fi export _SPARK_CMD_USAGE="Usage: ./bin/spark-sql [options] [cli option]" exec "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@" </code></pre> That again confirms, as it submits it to <code>thriftserver</code>, that you can't use the <code>UDF</code> at <code>spark-sql</code> CLI.

Pandas UDFs are vectorized UDFs meant to avoid row by row iterations inside PySpark. Once these UDFs are registered, they behave like PySpark Function APIs. They will reside and run inside Python worker. As @tukan mentioned, Spark SQL CLI cannot talk to JDBC server. So, Spark doesn't natively support this. However, you can make a custom RPC call to invoke it directly but that's not as easy or same as what you want to do in the first place.

It's not possible to use Python UDF in the way you want at this moment. But the option is available for Scala/Java UDF, so if you're open to using Scala/Java, this is one way to do it. Note: I'm implementing HiveUDF as Spark supports HiveUDF. The first thing you need to do is create a Java project with the following sample structure: <pre class="prettyprint"><code>root | - pom.xml | - src/main/com/test/udf/SimpleConcatUDF.java </code></pre> pom.xml <pre class="prettyprint lang-xml prettyprint-override"><code><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.test.udf</groupId> <artifactId>simple-concat-udf</artifactId> <version>1.0-SNAPSHOT</version> <packaging>jar</packaging> <properties> <hive.version>3.1.2</hive.version> </properties> <repositories> <repository> <id>hortonworks</id> <url>http://repo.hortonworks.com/content/groups/public</url> <snapshots> <enabled>true</enabled> </snapshots> </repository> </repositories> <dependencies> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>${hive.version}</version> </dependency> </dependencies> <build> <pluginManagement> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>2.3.2</version> <configuration> <source>1.6</source> <target>1.6</target> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-eclipse-plugin</artifactId> <version>2.9</version> <configuration> <useProjectReferences>false</useProjectReferences> </configuration> </plugin> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <appendAssemblyId>false</appendAssemblyId> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> <archive> <manifest> <mainClass>com.test.udf.SimpleConcatUDF</mainClass> </manifest> </archive> </configuration> </plugin> </plugins> </pluginManagement> </build> </project> </code></pre> SimpleConcatUDF.java <pre class="prettyprint lang-java prettyprint-override"><code>package com.test.udf; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public class SimpleConcatUDF extends UDF { public String evaluate(final Text text) { return text.toString() + "_from_udf"; } } </code></pre> The next thing you'd want to do is compile and package it. I'm using maven so the standard command is: <pre class="prettyprint lang-sh prettyprint-override"><code>cd <project-root-path>/ mvn clean install # output jar file is located at <project-root-path>/target/simple-concat-udf-1.0-SNAPSHOT.jar </code></pre> Finally, you'd need to register it using <code>create function</code>. This just needs to be done once if you register the function as permanent. Otherwise, you can register it as temporary. <pre class="prettyprint"><code>spark-sql> create function simple_concat AS 'com.test.udf.SimpleConcatUDF' using jar '<project-root-path>/target/simple-concat-udf-1.0-SNAPSHOT.jar'; spark-sql> show user functions; default.simple_concat Time taken: 1.868 seconds, Fetched 1 row(s) spark-sql> select simple_concat('a'); a_from_udf Time taken: 0.079 seconds, Fetched 1 row(s) </code></pre> NOTE: If you have HDFS in your system, you'd want to copy the jar file to HDFS and create function using that HDFS path instead of local path like above.

Possible to use Spark Pandas UDF in pure Spark SQL?

Tags:

apache-spark-sql

pyspark

This works:

from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
import pandas as pd

spark = SparkSession.builder.getOrCreate()

@pandas_udf(returnType="long")
def add_one(v: pd.Series) -> pd.Series:
    return v.add(1)

spark.udf.register("add_one", add_one)

spark.sql("select add_one(1)").show()

However, I'm wondering if/how I can make the following work:

$ spark-sql -e 'select add_one(1)'

683

asked Oct 18 '21 23:10

Neil McGuigan

3 Answers

Now that would be very nice if one could use that.

I'm afraid that this is currently not possible. Funny is that nobody actually mentions it.

The information is actually "hidden" in the apache spark documentation in a small note:

Note that the Spark SQL CLI cannot talk to the Thrift JDBC server.

As you probably understand the implications that means you can't call UDFs from the CLI spark-sql. Here is the link to the documentation.

One can double check the bin/spark-sql source code at github what is actually done:

if [ -z "${SPARK_HOME}" ]; then
  source "$(dirname "$0")"/find-spark-home
fi

export _SPARK_CMD_USAGE="Usage: ./bin/spark-sql [options] [cli option]"
exec "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"

That again confirms, as it submits it to thriftserver, that you can't use the UDF at spark-sql CLI.

175

answered Oct 21 '22 11:10

tukan

Pandas UDFs are vectorized UDFs meant to avoid row by row iterations inside PySpark. Once these UDFs are registered, they behave like PySpark Function APIs. They will reside and run inside Python worker.

As @tukan mentioned, Spark SQL CLI cannot talk to JDBC server. So, Spark doesn't natively support this.

However, you can make a custom RPC call to invoke it directly but that's not as easy or same as what you want to do in the first place.

answered Oct 21 '22 13:10

Ashvjit Singh

It's not possible to use Python UDF in the way you want at this moment. But the option is available for Scala/Java UDF, so if you're open to using Scala/Java, this is one way to do it. Note: I'm implementing HiveUDF as Spark supports HiveUDF.

The first thing you need to do is create a Java project with the following sample structure:

root
| - pom.xml
| - src/main/com/test/udf/SimpleConcatUDF.java

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.test.udf</groupId>
  <artifactId>simple-concat-udf</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <properties>
    <hive.version>3.1.2</hive.version>
  </properties>

  <repositories>
    <repository>
      <id>hortonworks</id>
      <url>http://repo.hortonworks.com/content/groups/public</url>
      <snapshots>
        <enabled>true</enabled>
      </snapshots>
    </repository>
  </repositories>

  <dependencies>
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>${hive.version}</version>
    </dependency>
  </dependencies>

  <build>
    <pluginManagement>
      <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-compiler-plugin</artifactId>
          <version>2.3.2</version>
          <configuration>
            <source>1.6</source>
            <target>1.6</target>
          </configuration>
        </plugin>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-eclipse-plugin</artifactId>
          <version>2.9</version>
          <configuration>
            <useProjectReferences>false</useProjectReferences>
          </configuration>
        </plugin>
        <plugin>
          <artifactId>maven-assembly-plugin</artifactId>
          <configuration>
            <appendAssemblyId>false</appendAssemblyId>
            <descriptorRefs>
              <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
            <archive>
              <manifest>
                <mainClass>com.test.udf.SimpleConcatUDF</mainClass>
              </manifest>
            </archive>
          </configuration>
        </plugin>
      </plugins>
    </pluginManagement>
  </build>

</project>

SimpleConcatUDF.java

package com.test.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class SimpleConcatUDF extends UDF {

  public String evaluate(final Text text) {
    return text.toString() + "_from_udf";
  }

}

The next thing you'd want to do is compile and package it. I'm using maven so the standard command is:

cd <project-root-path>/
mvn clean install
# output jar file is located at <project-root-path>/target/simple-concat-udf-1.0-SNAPSHOT.jar

Finally, you'd need to register it using create function. This just needs to be done once if you register the function as permanent. Otherwise, you can register it as temporary.

spark-sql> create function simple_concat AS 'com.test.udf.SimpleConcatUDF' using jar '<project-root-path>/target/simple-concat-udf-1.0-SNAPSHOT.jar';
spark-sql> show user functions;
default.simple_concat
Time taken: 1.868 seconds, Fetched 1 row(s)
spark-sql> select simple_concat('a');
a_from_udf
Time taken: 0.079 seconds, Fetched 1 row(s)

NOTE: If you have HDFS in your system, you'd want to copy the jar file to HDFS and create function using that HDFS path instead of local path like above.

answered Oct 21 '22 12:10

pltc

Related questions
                            
                                Pyspark, initializing spark programmatically : IllegalArgumentException: Missing application resource
                            
                                Fuzzy matching a word inside a pyspark dataframe string
                            
                                Spark Dataframe hanging on save
                            
                                ERROR WHILE RUNNING collect() in PYSPARK
                            
                                Stateful udfs in spark sql, or how to obtain mapPartitions performance benefit in spark sql?
                            
                                Cannot load pipeline model from pyspark
                            
                                prioritizing partitions / task execution in spark
                            
                                Pyspark: K means result with distance or deviation?
                            
                                How to skip multiple lines using read.csv in PySpark
                            
                                PySpark DataFrame change column of string to array before using explode
                            
                                PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark
                            
                                When to use a UDF versus a function in PySpark? [duplicate]
                            
                                How to apply large python model to pyspark-dataframe?
                            
                                Spark Caused by: java.lang.StackOverflowError Window Function?
                            
                                Best practice for feeding spark dataframes for training Tensorflow network
                            
                                Pyspark Window function on entire data frame
                            
                                Job 65 cancelled because SparkContext was shut down
                            
                                PySpark - pass a value from another column as the parameter of spark function
                            
                                How to convert a sklearn pipeline into a pyspark pipeline?
                            
                                PySpark data skewness with Window Functions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With