Using .where() on pyspark.sql.functions.max().over(window) on Spark 2.4 throws Java exception

Tags:

I followed a post on StackOverflow about returning the maximum of a column grouped by another column, and got an unexpected Java exception.

Here is the test data:

import pyspark.sql.functions as f
data = [('a', 5), ('a', 8), ('a', 7), ('b', 1), ('b', 3)]
df = spark.createDataFrame(data, ["A", "B"])
df.show()

+---+---+
|  A|  B|
+---+---+
|  a|  5|
|  a|  8|
|  a|  7|
|  b|  1|
|  b|  3|
+---+---+

Here is the solution that allegedly works for other users:

from pyspark.sql import Window
w = Window.partitionBy('A')
df.withColumn('maxB', f.max('B').over(w))\
    .where(f.col('B') == f.col('maxB'))\
    .drop('maxB').show()

which should produce this output:

#+---+---+
#|  A|  B|
#+---+---+
#|  a|  8|
#|  b|  3|
#+---+---+

Instead, I get:

java.lang.UnsupportedOperationException: Cannot evaluate expression: max(input[2, bigint, false]) windowspecdefinition(input[0, string, true], specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$()))

I have only tried this on Spark 2.4 on Databricks. I tried the equivalent SQL syntax and got the same error.

237

asked Feb 03 '19 23:02

AltShift

1 Answers

Databricks Support was able to reproduce the issue on Spark 2.4 but not on earlier versions. Apparently, it arises from a difference in the way the physical plan is formulated (I can post their response if requested). A fix is planned.

Meanwhile, here is one alternative solution to the original problem that does not fall prey to the version 2.4 issue:

df.withColumn("maxB", f.max('B').over(w)).drop('B').distinct().show()

+---+----+
|  A|maxB|
+---+----+
|  b|   3|
|  a|   8|
+---+----+

answered Oct 01 '22 20:10

AltShift

Related questions
                            
                                Running Python startup code after modules are loaded
                            
                                How to use PySpark to load a rolling window from daily files?
                            
                                What is the difference between tensorflow on spark with the default distributed tensorflow 1.0?
                            
                                Spark error - Decimal precision 39 exceeds max precision 38
                            
                                Unsupported literal type class in Apache Spark in scala
                            
                                Spark-Streaming Kafka Direct Streaming API & Parallelism
                            
                                How to save a spark dataframe to csv on HDFS?
                            
                                Is there no "inverse_transform" method for a scaler like MinMaxScaler in spark?
                            
                                Read CSV with linebreaks in pyspark
                            
                                Serve real-time predictions with trained Spark ML model [duplicate]
                            
                                Spark Streaming Guarantee Specific Start Window Time
                            
                                How read table with non utf-8 encoding in aws gllue?
                            
                                Error: Could not find or load main class org.apache.spark.launcher.Main [duplicate]
                            
                                Export environment variables at runtime with airflow
                            
                                Spark Structured Streaming Writestream to Hive ORC Partioned External Table
                            
                                How to set SPARK_LOCAL_DIRS parameter using spark-env.sh file
                            
                                GC Logs Overwritten when JVM Crashes
                            
                                Spark Structured Streaming Checkpoint Compatibility
                            
                                What can cause a stage to reattempt in Spark
                            
                                Zeppelin does not display stack trace

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using .where() on pyspark.sql.functions.max().over(window) on Spark 2.4 throws Java exception

Tags:

exception

apache-spark

apache-spark-sql

pyspark

AltShift

People also ask

1 Answers

AltShift

Recent Activity

Donate For Us