Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use first and last function in pyspark?

I used first and last functions to get first and last values of one column. But, I found the both of functions don't work as what I supposed. I referred to the answer @zero323, but I am still confusing with the both. the code like:

df = spark.sparkContext.parallelize([
    ("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)
]).toDF(["k", "v"])
w = Window().partitionBy("k").orderBy('k','v')

df.select(F.col("k"), F.last("v",True).over(w).alias('v')).show()

the result:

+---+----+
|  k|   v|
+---+----+
|  b|   1|
|  b|   3|
|  a|null|
|  a|  -1|
|  a|   1|
+---+----+

I supposed it should be like:

+---+----+
|  k|   v|
+---+----+
|  b|   3|
|  b|   3|
|  a|   1|
|  a|   1|
|  a|   1|
+---+----+

because, I showed df by operation of orderBy on 'k' and 'v':

df.orderBy('k','v').show()
    +---+----+
    |  k|   v|
    +---+----+
    |  a|null|
    |  a|  -1|
    |  a|   1|
    |  b|   1|
    |  b|   3|
    +---+----+

additionally, I figured out the other solution to test this kind of problems, my code like:

df.orderBy('k','v').groupBy('k').agg(F.first('v')).show()

I found that it was possible that its results are different after running above it every time . Was someone met the same experience like me? I hope to use the both of functions in my project, but I found those solutions are inconclusive.

like image 445
Ivan Lee Avatar asked Mar 30 '17 09:03

Ivan Lee


1 Answers

Have a look at Question 47130030.
The issue is not with the last() function but with the frame, which includes only rows up to the current one.
Using

w = Window().partitionBy("k").orderBy('k','v').rowsBetween(W.unboundedPreceding,W.unboundedFollowing)

will yield correct results for first() and last().

like image 100
Elke Avatar answered Oct 14 '22 07:10

Elke