Using a column value as a parameter to a spark DataFrame function

Tags:

Consider the following DataFrame:

#+------+---+
#|letter|rpt|
#+------+---+
#|     X|  3|
#|     Y|  1|
#|     Z|  2|
#+------+---+

which can be created using the following code:

df = spark.createDataFrame([("X", 3),("Y", 1),("Z", 2)], ["letter", "rpt"])

Suppose I wanted to repeat each row the number of times specified in the column rpt, just like in this question.

One way would be to replicate my solution to that question using the following pyspark-sql query:

query = """
SELECT *
FROM
  (SELECT DISTINCT *,
                   posexplode(split(repeat(",", rpt), ",")) AS (index, col)
   FROM df) AS a
WHERE index > 0
"""
query = query.replace("\n", " ")  # replace newlines with spaces, avoid EOF error
spark.sql(query).drop("col").sort('letter', 'index').show()
#+------+---+-----+
#|letter|rpt|index|
#+------+---+-----+
#|     X|  3|    1|
#|     X|  3|    2|
#|     X|  3|    3|
#|     Y|  1|    1|
#|     Z|  2|    1|
#|     Z|  2|    2|
#+------+---+-----+

This works and produces the correct answer. However, I am unable to replicate this behavior using the DataFrame API functions.

I tried:

import pyspark.sql.functions as f
df.select(
    f.posexplode(f.split(f.repeat(",", f.col("rpt")), ",")).alias("index", "col")
).show()

But this results in:

TypeError: 'Column' object is not callable

Why am I able to pass the column as an input to repeat within the query, but not from the API? Is there a way to replicate this behavior using the spark DataFrame functions?

645

asked Jul 02 '18 16:07

pault

1 Answers

One option is to use pyspark.sql.functions.expr, which allows you to use columns values as inputs to spark-sql functions.

Based on @user8371915's comment I have found that the following works:

from pyspark.sql.functions import expr

df.select(
    '*',
    expr('posexplode(split(repeat(",", rpt), ","))').alias("index", "col")
).where('index > 0').drop("col").sort('letter', 'index').show()
#+------+---+-----+
#|letter|rpt|index|
#+------+---+-----+
#|     X|  3|    1|
#|     X|  3|    2|
#|     X|  3|    3|
#|     Y|  1|    1|
#|     Z|  2|    1|
#|     Z|  2|    2|
#+------+---+-----+

186

answered Oct 25 '22 06:10

pault

Related questions
                            
                                Partition a spark dataframe based on column value?
                            
                                Spark Dataframe Returning NULL when specifying a Schema
                            
                                What are the benefits of running multiple Spark tasks in the same JVM?
                            
                                What does "streaming" mean in Apache Spark and Apache Flink?
                            
                                PySpark, importing schema through JSON file
                            
                                Duplicated Spark Context with IntelliJ in Worksheet
                            
                                Implement a directed Graph as an undirected graph using GraphX
                            
                                How to calculate rolling median in PySpark using Window()?
                            
                                Find mean of pyspark array<double>
                            
                                How to run a spark example program in Intellij IDEA
                            
                                read files recursively from sub directories with spark from s3 or local filesystem
                            
                                Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]
                            
                                Converting multiple different columns to Map column with Spark Dataframe scala
                            
                                Apache Spark: "failed to launch org.apache.spark.deploy.worker.Worker" or Master
                            
                                Change output filename prefix for DataFrame.write()
                            
                                Mode of grouped data in (py)Spark
                            
                                What does "Correlated scalar subqueries must be Aggregated" mean?
                            
                                spark on yarn, Container exited with a non-zero exit code 143
                            
                                dataframe Spark scala explode json array
                            
                                How to use XGboost in PySpark Pipeline

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using a column value as a parameter to a spark DataFrame function

Tags:

apache-spark

apache-spark-sql

pyspark

pyspark-sql

pault

People also ask

1 Answers

pault

Recent Activity

Donate For Us