first_value windowing function in pyspark

Tags:

I am using pyspark 1.5 getting my data from Hive tables and trying to use windowing functions.

According to this there exists an analytic function called firstValue that will give me the first non-null value for a given window. I know this exists in Hive but I can not find this in pyspark anywhere.

Is there a way to implement this given that pyspark won't allow UserDefinedAggregateFunctions (UDAFs)?

230

asked Feb 01 '16 23:02

liber

1 Answers

Spark >= 2.0:

first takes an optional ignorenulls argument which can mimic the behavior of first_value:

df.select(col("k"), first("v", True).over(w).alias("fv"))

Spark < 2.0:

Available function is called first and can be used as follows:

df = sc.parallelize([
    ("a", None), ("a", 1), ("a", -1), ("b", 3)
]).toDF(["k", "v"])

w = Window().partitionBy("k").orderBy("v")

df.select(col("k"), first("v").over(w).alias("fv"))

but if you want to ignore nulls you'll have to use Hive UDFs directly:

df.registerTempTable("df")

sqlContext.sql("""
    SELECT k, first_value(v, TRUE) OVER (PARTITION BY k ORDER BY v)
    FROM df""")

180

answered Sep 19 '22 00:09

zero323

Related questions
                            
                                spark sql count(*) query store result
                            
                                Spark Parquet Loader: Reduce number of jobs involved in listing a dataframe's files
                            
                                Spark 2.3.0 Read Text File With Header Option Not Working
                            
                                substring multiple characters from the last index of a pyspark string column using negative indexing
                            
                                weekofyear() returning seemingly incorrect results for January 1
                            
                                Kafka - Could not find a 'KafkaClient' entry in the JAAS configuration java
                            
                                PySpark - to_date format from column
                            
                                Pyspark 2.4.0, read avro from kafka with read stream - Python
                            
                                PySpark: How to Append Dataframes in For Loop
                            
                                How to count the trailing zeroes in an array column in a PySpark dataframe without a UDF
                            
                                How to make Spark session read all the files recursively?
                            
                                Overloaded method foreachBatch with alternatives
                            
                                spark on yarn; how to send metrics to graphite sink?
                            
                                How can I select a non-sequential subset elements from an array using Scala and Spark?
                            
                                How to install Apache Zeppelin on existing Apache Spark standalone cluster
                            
                                IntelliJ Idea 14: cannot resolve symbol spark
                            
                                How to print rdd in python in spark
                            
                                How to sort an RDD of tuples with 5 elements in Spark Scala?
                            
                                Spark ExecutorLostFailure
                            
                                Stack Overflow while processing several columns with a UDF

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

first_value windowing function in pyspark

Tags:

window-functions

apache-spark

apache-spark-sql

pyspark

liber

People also ask

1 Answers

zero323

Recent Activity

Donate For Us