I'm trying to use Spark 1.4 window functions in pyspark 1.4.1 but getting mostly errors or unexpected results. Here is a very simple example that I think should work: <pre class="prettyprint"><code>from pyspark.sql.window import Window import pyspark.sql.functions as func l = [(1,101),(2,202),(3,303),(4,404),(5,505)] df = sqlContext.createDataFrame(l,["a","b"]) wSpec = Window.orderBy(df.a).rowsBetween(-1,1) df.select(df.a, func.rank().over(wSpec).alias("rank")) ==> Failure org.apache.spark.sql.AnalysisException: Window function rank does not take a frame specification. df.select(df.a, func.lag(df.b,1).over(wSpec).alias("prev"), df.b, func.lead(df.b,1).over(wSpec).alias("next")) ===> org.apache.spark.sql.AnalysisException: Window function lag does not take a frame specification.; wSpec = Window.orderBy(df.a) df.select(df.a, func.rank().over(wSpec).alias("rank")) ===> org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: One or more arguments are expected. df.select(df.a, func.lag(df.b,1).over(wSpec).alias("prev"), df.b, func.lead(df.b,1).over(wSpec).alias("next")).collect() [Row(a=1, prev=None, b=101, next=None), Row(a=2, prev=None, b=202, next=None), Row(a=3, prev=None, b=303, next=None)] </code></pre> As you can see, if I add <code>rowsBetween</code> frame specification, neither <code>rank()</code> nor <code>lag/lead()</code> window functions recognize it: "Window function does not take a frame specification". If I omit the <code>rowsBetween</code> frame specification at leas <code>lag/lead()</code> do not throw exceptions but return unexpected (for me) result: always <code>None</code>. And the <code>rank()</code> still doesn't work with different exception. Can anybody help me to get my window functions right? UPDATE All right, that starts to look as a pyspark bug. I have prepared the same test in pure Spark (Scala, spark-shell): <pre class="prettyprint"><code>import sqlContext.implicits._ import org.apache.spark.sql._ import org.apache.spark.sql.types._ val l: List[Tuple2[Int,Int]] = List((1,101),(2,202),(3,303),(4,404),(5,505)) val rdd = sc.parallelize(l).map(i => Row(i._1,i._2)) val schemaString = "a b" val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, IntegerType, true))) val df = sqlContext.createDataFrame(rdd, schema) import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ val wSpec = Window.orderBy("a").rowsBetween(-1,1) df.select(df("a"), rank().over(wSpec).alias("rank")) ==> org.apache.spark.sql.AnalysisException: Window function rank does not take a frame specification.; df.select(df("a"), lag(df("b"),1).over(wSpec).alias("prev"), df("b"), lead(df("b"),1).over(wSpec).alias("next")) ===> org.apache.spark.sql.AnalysisException: Window function lag does not take a frame specification.; val wSpec = Window.orderBy("a") df.select(df("a"), rank().over(wSpec).alias("rank")).collect() ====> res10: Array[org.apache.spark.sql.Row] = Array([1,1], [2,2], [3,3], [4,4], [5,5]) df.select(df("a"), lag(df("b"),1).over(wSpec).alias("prev"), df("b"), lead(df("b"),1).over(wSpec).alias("next")) ====> res12: Array[org.apache.spark.sql.Row] = Array([1,null,101,202], [2,101,202,303], [3,202,303,404], [4,303,404,505], [5,404,505,null]) </code></pre> Even though the <code>rowsBetween</code> cannot be applied in Scala, both <code>rank()</code> and <code>lag()/lead()</code> work as I expect when <code>rowsBetween</code> is omitted.

As far as I can tell there two different problems. Window frame definition is simply not supported by Hive <code>GenericUDAFRank</code>, <code>GenericUDAFLag</code> and <code>GenericUDAFLead</code> so errors you see are an expected behavior. Regarding issue with the following PySpark code <pre class="prettyprint"><code>wSpec = Window.orderBy(df.a) df.select(df.a, func.rank().over(wSpec).alias("rank")) </code></pre> it looks like it is related to my question https://stackoverflow.com/q/31948194/1560062 and should be addressed by SPARK-9978. As far now you can make it work by changing window definition to this: <pre class="prettyprint"><code>wSpec = Window.partitionBy().orderBy(df.a) </code></pre>

Why do Window functions fail with "Window function X does not take a frame specification"?

Tags:

window-functions

apache-spark

apache-spark-sql

pyspark

pyspark-sql

I'm trying to use Spark 1.4 window functions in pyspark 1.4.1

but getting mostly errors or unexpected results. Here is a very simple example that I think should work:

from pyspark.sql.window import Window
import pyspark.sql.functions as func

l = [(1,101),(2,202),(3,303),(4,404),(5,505)]
df = sqlContext.createDataFrame(l,["a","b"])

wSpec = Window.orderBy(df.a).rowsBetween(-1,1)

df.select(df.a, func.rank().over(wSpec).alias("rank"))  
    ==> Failure org.apache.spark.sql.AnalysisException: Window function rank does not take a frame specification.

df.select(df.a, func.lag(df.b,1).over(wSpec).alias("prev"), df.b, func.lead(df.b,1).over(wSpec).alias("next"))  
    ===>  org.apache.spark.sql.AnalysisException: Window function lag does not take a frame specification.;


wSpec = Window.orderBy(df.a)

df.select(df.a, func.rank().over(wSpec).alias("rank"))
    ===> org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: One or more arguments are expected.

df.select(df.a, func.lag(df.b,1).over(wSpec).alias("prev"), df.b, func.lead(df.b,1).over(wSpec).alias("next")).collect()

    [Row(a=1, prev=None, b=101, next=None), Row(a=2, prev=None, b=202, next=None), Row(a=3, prev=None, b=303, next=None)]

As you can see, if I add rowsBetween frame specification, neither rank() nor lag/lead() window functions recognize it: "Window function does not take a frame specification".

If I omit the rowsBetween frame specification at leas lag/lead() do not throw exceptions but return unexpected (for me) result: always None. And the rank() still doesn't work with different exception.

Can anybody help me to get my window functions right?

UPDATE

All right, that starts to look as a pyspark bug. I have prepared the same test in pure Spark (Scala, spark-shell):

import sqlContext.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.types._

val l: List[Tuple2[Int,Int]] = List((1,101),(2,202),(3,303),(4,404),(5,505))
val rdd = sc.parallelize(l).map(i => Row(i._1,i._2))
val schemaString = "a b"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, IntegerType, true)))
val df = sqlContext.createDataFrame(rdd, schema)

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val wSpec = Window.orderBy("a").rowsBetween(-1,1)
df.select(df("a"), rank().over(wSpec).alias("rank"))
    ==> org.apache.spark.sql.AnalysisException: Window function rank does not take a frame specification.;

df.select(df("a"), lag(df("b"),1).over(wSpec).alias("prev"), df("b"), lead(df("b"),1).over(wSpec).alias("next"))
    ===> org.apache.spark.sql.AnalysisException: Window function lag does not take a frame specification.;


val wSpec = Window.orderBy("a")
df.select(df("a"), rank().over(wSpec).alias("rank")).collect()
    ====> res10: Array[org.apache.spark.sql.Row] = Array([1,1], [2,2], [3,3], [4,4], [5,5])

df.select(df("a"), lag(df("b"),1).over(wSpec).alias("prev"), df("b"), lead(df("b"),1).over(wSpec).alias("next"))
    ====> res12: Array[org.apache.spark.sql.Row] = Array([1,null,101,202], [2,101,202,303], [3,202,303,404], [4,303,404,505], [5,404,505,null])

Even though the rowsBetween cannot be applied in Scala, both rank() and lag()/lead() work as I expect when rowsBetween is omitted.

823

asked Sep 03 '15 13:09

Sergey Shcherbakov

1 Answers

As far as I can tell there two different problems. Window frame definition is simply not supported by Hive GenericUDAFRank, GenericUDAFLag and GenericUDAFLead so errors you see are an expected behavior.

Regarding issue with the following PySpark code

wSpec = Window.orderBy(df.a)
df.select(df.a, func.rank().over(wSpec).alias("rank"))

it looks like it is related to my question https://stackoverflow.com/q/31948194/1560062 and should be addressed by SPARK-9978. As far now you can make it work by changing window definition to this:

wSpec = Window.partitionBy().orderBy(df.a)

121

answered Sep 29 '22 01:09

zero323

Related questions
                            
                                Outer join two Datasets (not DataFrames) in Spark Structured Streaming
                            
                                In Spark ML, why is fitting a StringIndexer on a column with million of disctinct values yielding an OOM error?
                            
                                Spark Strucutured Streaming Window on non-timestamp column
                            
                                Access AWS Glue from local Spark
                            
                                PySpark: Deserializing an Avro serialized message contained in an eventhub capture avro file
                            
                                How to get the table name from Spark SQL Query [PySpark]?
                            
                                Fastest way to take elementwise sum of two Lists
                            
                                Spark and Hive in Hadoop 3: Difference between metastore.catalog.default and spark.sql.catalogImplementation
                            
                                How to convert a struct field in a Row to an avro record in Spark Java
                            
                                High Concurrency Clusters in Databricks
                            
                                Cassandra + Solr/Hadoop/Spark - Choosing the right tools
                            
                                Spark Sql JDBC Support
                            
                                How to convert scala.collection.Set to java.util.Set with serializable within an RDD
                            
                                Spark Streaming groupByKey and updateStateByKey implementation
                            
                                Spark SQL performance
                            
                                Using PartitionBy to split and efficiently compute RDD groups by Key
                            
                                Apache Phoenix vs Hive-Spark
                            
                                Spark Task not serializable (Case Classes)
                            
                                Is there a way to rewrite Spark RDD distinct to use mapPartitions instead of distinct?
                            
                                how to build a graph from tuples in graphx and label the nodes after ?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With