I followed a post on StackOverflow about returning the maximum of a column grouped by another column, and got an unexpected Java exception.
Here is the test data:
import pyspark.sql.functions as f
data = [('a', 5), ('a', 8), ('a', 7), ('b', 1), ('b', 3)]
df = spark.createDataFrame(data, ["A", "B"])
df.show()
+---+---+
| A| B|
+---+---+
| a| 5|
| a| 8|
| a| 7|
| b| 1|
| b| 3|
+---+---+
Here is the solution that allegedly works for other users:
from pyspark.sql import Window
w = Window.partitionBy('A')
df.withColumn('maxB', f.max('B').over(w))\
.where(f.col('B') == f.col('maxB'))\
.drop('maxB').show()
which should produce this output:
#+---+---+
#| A| B|
#+---+---+
#| a| 8|
#| b| 3|
#+---+---+
Instead, I get:
java.lang.UnsupportedOperationException: Cannot evaluate expression: max(input[2, bigint, false]) windowspecdefinition(input[0, string, true], specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$()))
I have only tried this on Spark 2.4 on Databricks. I tried the equivalent SQL syntax and got the same error.
from pyspark.sql import SparkSession A spark session can be used to create the Dataset and DataFrame API. A SparkSession can also be used to create DataFrame, register DataFrame as a table, execute SQL over tables, cache table, and read parquet file.
Method -1 : Using select() method Using the max() method, we can get the maximum value from the column. To use this method, we have to import it from pyspark. sql. functions module, and finally, we can use the collect() method to get the maximum from the column.
PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row individually. It is also popularly growing to perform data transformations.
By default Spark with Scala, Java, or with Python (PySpark), fetches only 20 rows from DataFrame show() but not all rows and the column value is truncated to 20 characters, In order to fetch/display more than 20 rows and column full value from Spark/PySpark DataFrame, you need to pass arguments to the show() method.
Databricks Support was able to reproduce the issue on Spark 2.4 but not on earlier versions. Apparently, it arises from a difference in the way the physical plan is formulated (I can post their response if requested). A fix is planned.
Meanwhile, here is one alternative solution to the original problem that does not fall prey to the version 2.4 issue:
df.withColumn("maxB", f.max('B').over(w)).drop('B').distinct().show()
+---+----+
| A|maxB|
+---+----+
| b| 3|
| a| 8|
+---+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With